0% found this document useful (0 votes)
97 views

Module 1: PARALLEL AND DISTRIBUTED COMPUTING

The document discusses parallelism and its importance in improving performance. It covers key concepts like what parallelism is, why it is needed due to hardware and software limitations, and how parallel computing can help solve challenges in fields like healthcare and environmental science. The document also discusses evolution of parallel architectures from single core to multi-core processors and trends like distributed computing across multiple nodes. It highlights goals of parallelism like scalability and performance and challenges in utilizing parallelism effectively.
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
97 views

Module 1: PARALLEL AND DISTRIBUTED COMPUTING

The document discusses parallelism and its importance in improving performance. It covers key concepts like what parallelism is, why it is needed due to hardware and software limitations, and how parallel computing can help solve challenges in fields like healthcare and environmental science. The document also discusses evolution of parallel architectures from single core to multi-core processors and trends like distributed computing across multiple nodes. It highlights goals of parallelism like scalability and performance and challenges in utilizing parallelism effectively.
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 65

Module 1 : Parallelism

Fundamentals
Motivation
• With respect to both hardware and software,
– Road blocks to improve performance with serial
code / single uni-core systems

– Hardware has to support parallelism for improved


performance

– Software has to be able to offer parallelism (in


terms of codes, compilers etc)
Key Concepts
What is parallelism?
• Parallelism is the process of executing multiple
Instructions on same data or single instruction on
multiple data or a combination of both

• Is parallelism and concurrency the same?


– Concurrency does not need redundant units while
parallel computing involves redundant structures
Why Parallelism?

• It is the obvious choice of high performance


evolution

• A learning from applications


– Concurrency and parallelism is key to future
systems
Why Parallel, Concurrent and Distributed
Computing?
• Parallel computing can transform science and engineering
• Example: Cosmology - the study of the universe
– Its evolution and structure—where one of the most striking
paradigm shifts has occurred. A number of new, tremendously
detailed observations deep into the universe are available from
such instruments as the Hubble Space Telescope and the Digital
Sky Survey. However, until recently, it has been difficult, except in
relatively simple circumstances, to use mathematical theories of
the early universe to allow comparison with observations.
– Scalable parallel computers with large memories have changed
all of that
To port or not to port?
• That should not suffix (porting)
• Write parallel applications for parallel
architectures should involve reformulating
data structures, basic code and the dynamics
involved
Parallel supercomputing can answer
challenges to society
• Parallel computing that not only involves
particle studies but for human related data
– Health care
– Weather / environmental application
The burden is not on the ‘Hardware’ alone

• Even with existing architecture, there are lot


of ways the parallelism can be adapted to
harness efficiency in terms of execution time,
power and so on
Overview of Parallel Computing,
Architectural Demands and Trends
Evolution of Parallel architecture
• Stored Program Concept

• ILP (Pipelining) (SISD)

• TLP

• OOP (Out-of Order Processors)

• Vector Processors (SIMD)

• Simultaneous Multi-Threading (SMT)

• Multi-core Processors (MIMD)

• SIMT Architectures (GPU)

• Multi-node systems (With multiple cores, GPUetc) Distributed


Stored Program Computers
CPU Fundamentals

• Primary function is to execute Instructions


• Program Execution Steps
– CPU transfers instructions and i/p data from
main memory to registers in the CPU
– CPU executes instruction in their stored
sequence (unless altered explicitly)
– CPU transfers o/p data from CPU registers to
main memory
Program Execution
Single Core Processors
Single-core computer
Un-pipelined Data Path
Single Processor Core
• Two Parts : Control Unit and Data Path
• Data Path

• Control Unit
– Unit to control the Data path
Pipelining
Pipelined Data-path

Include Registers in between to


provide ILP (hardware support)
Pipeline Hazards
• Structural Hazards

• Data Hazards

• Control Hazards
Pipeline Hazards & Solutions
• Structural Hazards
– Redundancy
• Data Hazards
– Forwarding, Loop Unrolling
• Control Hazards
– Branch Prediction
Constraints in In-order Execution

Data Hazard Structural Hazard


DIV.D F0,F2,F4 DIV.D F0,F2,F4
ADD.D F6,F0,F8 DIV.D F1,F3,F5
ADD.D F7,F9,F11
SUB.D F1,F3,F5
SUB.D F8,F10,F14
MUL.D F6,F10,F8
MUL.D F6,F10,F8
…Continued
Control Hazard
Beq F0,F2, S1
DIV.D F1,F3,F5
ADD.D F6,F0,F8
Jmp S2
S1: SUB.D F8,F10,F14
S2: MUL.D F7,F11,F9
Pipelined Processor with Out of Order Execution
• The fundamental problem we face when trying to keep four
functional units busy is that it's difficult to find contiguous sets
of instructions that can be executed in parallel.

• The solution to these problems is out-of-order execution and


speculative execution.

• Support by both hardware and software


– Hardware that allows instruction execution out of order
– Software to allow totally independent instructions to be
fetched
Drawback of In Order Execution
• A major limitation of in order execution is
– Stalling the execution pipeline till all the previous
instructions are issued irrespective of its data
dependency
• Eg:
DIV.D F0,F2,F4
ADD.D F10,F0,F8
SUB.D F12,F8,F14
The Idea of Out of Order Execution
• In the previous example, to issue SUB.D, we
must separate issue into two parts:
– Checking for any structural hazard
– Waiting for the absence of data hazard

• We still use in-order instruction issue, but we


want an instruction to begin execution as soon
as operands are available which implies Out Of
Order Completion
DIV F0, F1, F2
MUL F4, F2, F3
ADD F5, F0, F4
SUB F5, F2, F1
ShiftL F1, F6, F7
Continued…
• Out of order execution introduces the possibility of WAR and
WAW hazards which do not exist in the 5-stage pipeline

• Eg:
DIV.D F0,F2,F4
ADD.D F6,F0,F8
SUB.D F8,F10,F14
MUL.D F6,F10,F8
– WAR Hazard between ADD.D and SUB.D
– WAW Hazard between ADD.D and MUL.D
Out-of-Order Completion Complexities
• Exceptions should be preserved and imprecise
exceptions should not arise.

• Imprecise exceptions can occur because of two


possibilities
1. The pipeline may have already completed instructions
that are later in program order than the instruction
causing exception
2. The pipeline may have not completed some instructions
that are earlier in program order than the instruction
causing the exception
Out – of Order Execution
• It requires multiple functional units or pipelined functional
units or both.
• In a dynamically scheduled pipeline, all instructions pass
through the issue stage in order
• However, they can be stalled or bypass each other in the read
operand stage.
• Score-Boarding is a technique for allowing instructions out of
order execution when there are sufficient resources and data
dependences.
• Tomasulo’s Algorithm: handles anti-dependences and output
dependences by effectively renaming the registers
dynamically.
Evolution of Multi-Core
Era before Multi-Core
– Improved hardware technologies resulted in,

• Increased clock frequency

• Increased transistor density

• Exploiting ILP
What is Instruction Level Parallelism?
Power Wall

• Can put more transistors on a chip than afford


to turn on
Frequency Wall
• Dynamic power in a chip is proportional to
V2fC
• So increasing ‘f’ will lead to power wall.
Memory Wall
What’s the Solution?
• Because of the above discussed walls,
performance can no more be benefitted from ILP
or by increasing frequency

• Improved Thread level Parallelism will provide


the solution

• Software level threads were developed to


increase number of threads
Software Solution to Improve ILP
Concurrent Execution Model

 More complex computing systems allow a


user to run multiple applications that execute
at the same time.
Sequential Execution
• Conventional programs are called sequential
because a programmer imagines a computer
executing the code statement by statement

• At any instant, the machine is executing


exactly one statement
Thread Level Parallelism – Increasing Hardware
Threads
How TLP overcomes the barriers?
• Overcoming Power wall & Frequency wall
– A single processor operating at 4 GHz can be
replaced by a multiple processor each operating at
2 GHz.
• Overcoming Memory Wall
– Multithreading means cycle-by-cycle interleaving
of instructions from different processes.
– If one process is busy with memory, that latency
can be masked by executing other threads.
Why not Multi-Processors?
• Communication latency between processors
on different boards are high.

• This will become a bottle neck for


performance.
What is Multi-Core?
Moore’s law : Alive & Well
Gordon Moore (co-founder of Intel) predicted
in 1965 that the transistor density of
semiconductor chips would double roughly
every 18 months.
Comparison of different architectures

Single Core Multiprocessor

Multi-core processor

Multi-core processor with shared cache Multi-core processor with distributed cache
Goals of Parallelism
What are the Goals of parallelism?
• To make the architectural design scalable

• To have improved performance

• To Balance cost- performance and reliability


Have these goals been achieved with multi-
core?
• To certain extent

• The applications are unable to exploit the


parallelism available in the multi-core
processors

• This has to be addressed


Challenges
• To utilize the existing parallel architectures
thoroughly
– Skilled people to write parallel programs
– To develop tools and libraries for parallel
programming support
• To scale the existing systems to perform better
for the future
• Architectural models to cater to fast and big
data computing
Future Architectures
• Future architectures will emphasis not only on
the processors but also on
– I/O devices
– Memory
– Interconnection among nodes
Communication and Co-
ordination
INTERCONNECTION NETWORKS
FOR PARALLEL COMPUTERS

• Physically Shared Memory


Distributed Memory
Network Topologies
• Direct Networks
– Direct networks consist of physical
interconnection links that connect the nodes
(typically PEs) in a parallel computer
• Each node may need a router to make routing
decisions.
– Ring
– Mesh
Continued…
• Mesh Torus
Continued…
Hypercube network
Continued…
• Indirect Networks
– In indirect networks, each processing node is
connected to a network of switches over one or
more (often bidirectional) links.
– Typically, this network consists of one or more
stages of switch boxes
double arraySum = 7;
for (int i = 0; i < 100; i++) {
arraySum += A[i];
printf(arraySum)
}
arraySum=7

for (i=0 ; i<100;i<100;i=i+3)


{
arraySum+=a[i] (4 clk)
arraySum+=a[i+1] (
arraySum+=a[i+2];
}
loop3:
load $f10, 0($5) ; $f10 ← A[i]
add $f8, $f8, $f10 ; $f8 ← $f8 + A[i]
addi $5, $5, 8 ; increment ptr for A[i]
addi $7, $7, -1 ; decrement loop count
test:
bgtz $7, loop3 ; Continue if count > 0

You might also like