100% found this document useful (1 vote)

40 views19 pages

Lecture (2) .PPT-1

The document discusses parallel computing platforms and topics related to parallelism. It covers implicit parallelism in microprocessors, limitations of memory performance, different types of parallel platforms, and communication costs. Examples are provided around pipelining, superscalar execution, caches, and their impact on performance.

Uploaded by

nalahelmy

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

100% found this document useful (1 vote)

40 views19 pages

Lecture (2) .PPT-1

Uploaded by

nalahelmy

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 19

Parallel Computing Platforms

Lecture (2)
Chapter 2. Parallel Programming
Platforms
Topic Overview

• Implicit Parallelism: Trends in Microprocessor Architectures

• Limitations of Memory System Performance
• Dichotomy of Parallel Computing Platforms
• Communication Model of Parallel Platforms
• Physical Organization of Parallel Platforms
• Communication Costs in Parallel Machines
• Messaging Cost Models and Routing Mechanisms
• Mapping Techniques
• Case Studies
Scope of Parallelism

• Conventional architectures coarsely comprise of a processor, memory system, and the

datapath.
• Each of these components present significant performance bottlenecks.
• Parallelism addresses each of these components in significant ways.
• Different applications utilize different aspects of parallelism - e.g., data intensive applications
utilize high aggregate throughput, server applications utilize high aggregate network
bandwidth, and scientific applications typically utilize high processing and memory system
performance.
• It is important to understand each of these performance bottlenecks.
Implicit Parallelism: Trends in
Microprocessor Architectures
• Microprocessor clock speeds have posted impressive gains over the
past two decades (two to three orders of magnitude).

• different types of processors are microprocessor, microcontroller,

embedded processor, digital signal processor and the processors
can be varied according to the devices.

• Different of types of CPU are classified as single-core, dual-core,

Quad-core, Hexa core, Octa-core, and Deca core processor

• Higher levels of device integration have made available a large

number of transistors.

• The question of how best to utilize these resources is an important

one.

• Current processors use these resources in multiple functional units

and execute multiple instructions in the same cycle.
Pipelining and Superscalar Execution

• Pipelining overlaps various stages of instruction execution to achieve

performance.
• At a high level of abstraction, an instruction can be executed while the next
one is being decoded and the next one is being fetched.
• This is akin to an assembly line for manufacture of cars.
Pipelining and Superscalar Execution

• Pipelining, however, has several limitations.

• The speed of a pipeline is eventually
• For this reason, conventional processors rely on very deep pipelines (20
stage pipelines in state-of-the-art Pentium processors).
• However, in typical program traces, every 5-6th instruction is a conditional
jump!
• The penalty of a misbehavior grows with the depth of the pipeline, since a
larger number of instructions will have to be flushed.
Pipelining and Superscalar Execution

• One simple way of alleviating these bottlenecks is to use multiple pipelines.

• Then, it becomes one of pumps for these instructions.
Superscalar Execution: An Example
Superscalar Execution

• Scheduling of instructions is determined by a number of factors:

– True Data Dependency: The result of one operation is an input to the next.
– Resource Dependency: Two operations require the same resource.
– Branch Dependency: Scheduling instructions across conditional branch statements
cannot be done deterministically a-priori.
– The scheduler, a piece of hardware looks at a large number of instructions in an
instruction queue and selects appropriate number of instructions to execute concurrently
based on these factors.
– The complexity of this hardware is an important constraint on superscalar processors.
Superscalar Execution:
Issue Mechanisms
• In the simpler model, instructions can be issued only in the order in which
they are encountered. That is, if the second instruction cannot be issued
because it has a data dependency with the first, only one instruction is issued
in the cycle. This is called in-order issue.
• In a more aggressive model, instructions can be issued out of order. In this
case, if the second instruction has data dependencies with the first, but the
third instruction does not, the first and third instructions can be co-scheduled.
This is also called dynamic issue.
• Performance of in-order issue is generally limited.
Superscalar Execution:
Efficiency Considerations
• Not all functional units can be kept busy at all times.
• If during a cycle, no functional units are utilized, this is referred to as vertical
waste.
• If during a cycle, only some of the functional units are utilized, this is referred
to as horizontal waste.
• Due to limited parallelism in typical instruction traces, or the inability of the
scheduler to extract parallelism, the performance of superscalar processors
is eventually limited.
Very Long Instruction Word (VLIW) Processors

• The hardware cost and complexity of the superscalar scheduler is a major

consideration in processor design.
• To address this issues, VLIW processors rely on compile time analysis to
identify and bundle together instructions that can be executed concurrently.
• These instructions are packed and dispatched together, and thus the name
very long instruction word.
• This concept was used with some commercial success in the Multiflow Trace
machine (circa 1984).
• Variants of this concept are employed in the Intel IA64 processors.
• Typical VLIW processors are limited to 4-way to 8-way parallelism.
Limitations of
Memory System Performance
• Memory system, and not processor speed, is often the bottleneck for many
applications.
• Memory system performance is largely measured by two parameters, latency
and bandwidth.
• Latency is the time from the issue of a memory request to the time the data is
available at the processor.
• Bandwidth is the rate at which data can be pumped to the processor by the
memory system.
Memory System Performance: Bandwidth and Latency

• It is very important to understand the difference between latency and

bandwidth.
• Consider the example of a fire-hose. If the water comes out of the hose two
seconds after the hydrant is turned on, the latency of the system is two
seconds.
• Once the water starts flowing, if the hydrant delivers water at the rate of 5
gallons/second, the bandwidth of the system is 5 gallons/second.
• If you want immediate response from the hydrant, it is important to reduce
latency.
• If you want to fight big fires, you want high bandwidth.
Memory Latency: An Example

• Consider a processor operating at 1 GHz (1 ns clock) connected to a DRAM

with a latency of 100 ns (no caches). Assume that the processor has two
multiply-add units and is capable of executing four instructions in each cycle
of 1 ns. The following observations follow:
– The peak processor rating is 4 GFLOPS(floating point operations per second ,is a
measure of computer performance, useful in fields of scientific computations .)

– Since the memory latency is equal to 100 cycles and every time a memory request is
made, the processor must wait 100 cycles before it can process the data.
Memory Latency: An Example

• On the above architecture, consider the problem of computing a dot-product

of two vectors.
– A dot-product computation performs one multiply-add on a single pair of vector elements,
i.e., each floating point operation requires one data fetch.
– It follows that the peak speed of this computation is limited to one floating point operation
every 100 ns, a very small fraction of the peak processor rating!
Improving Effective Memory
Latency Using Caches
• Caches are small and fast memory elements between the processor and
DRAM.
• This memory acts as a low-latency high-bandwidth storage.
• If a piece of data is repeatedly used, the effective latency of this memory
system can be reduced by the cache.
• The fraction of data references satisfied by the cache is called the cache hit
ratio of the computation on the system.
• Cache hit ratio achieved by a code on a memory system often determines its
performance.
Impact of Caches: Example

Consider a processor operating at 1 GHz (1 ns clock) connected to a DRAM

with a latency of 100 ns (no caches). Assume that the processor has two
multiply-add units and is capable of executing four instructions in each cycle
of 1 ns. The following observations follow

Consider the architecture from the previous example. In this case, we

introduce a cache of size 32 KB which takes approximately 200 µs with a
latency of 1 ns or one cycle. We use this setup to multiply two matrices A and
B of dimensions 32 × 32. We have carefully chosen these numbers so that
the cache is large enough to store matrices A and B, as well as the result
matrix C.
Impact of Caches: Example (continued)

• The following observations can be made about the

problem:
– The following observations can be made about the problem:
– Fetching the two matrices into the cache corresponds to fetching
32KBs. which takes approximately 200 µs
– Multiplying two n × n matrices takes 32n32 operations. For our
problem, this corresponds to 64K operations, which can be
performed in 16K cycles (or 16 µs) at four instructions per cycle.
– The total time for the computation is therefore approximately the
sum of time for load/store operations and the time for the
computation itself, i.e., 200 + 16 µs.
– This corresponds to a peak computation rate of 64K/216 or
296.29 FLOPS.

Lego Case Study Solution
No ratings yet
Lego Case Study Solution
3 pages
PLC & Scada Lab Manual (Part-1)
70% (10)
PLC & Scada Lab Manual (Part-1)
33 pages
Chap2 Slides
No ratings yet
Chap2 Slides
127 pages
Parallel Processing
No ratings yet
Parallel Processing
127 pages
Parallel Computing Platforms-Dr Nausheen
No ratings yet
Parallel Computing Platforms-Dr Nausheen
47 pages
Module 2
No ratings yet
Module 2
127 pages
Module 1 - Parallel Computing
No ratings yet
Module 1 - Parallel Computing
29 pages
Unit 1 Modern Processors
No ratings yet
Unit 1 Modern Processors
52 pages
Computer Architecture 1st Semester Spring Session Unit 3
No ratings yet
Computer Architecture 1st Semester Spring Session Unit 3
33 pages
Parallel Programming Platforms: Alexandre David 1.2.05
No ratings yet
Parallel Programming Platforms: Alexandre David 1.2.05
30 pages
Lecture 2
No ratings yet
Lecture 2
17 pages
CS326 Parallel and Distributed Computing: SPRING 2021 National University of Computer and Emerging Sciences
No ratings yet
CS326 Parallel and Distributed Computing: SPRING 2021 National University of Computer and Emerging Sciences
33 pages
Hpc_unit-1 Insem Notes
No ratings yet
Hpc_unit-1 Insem Notes
76 pages
Lec2 ParallelProgrammingPlatforms
No ratings yet
Lec2 ParallelProgrammingPlatforms
26 pages
HPC-Unit-1
No ratings yet
HPC-Unit-1
65 pages
09 Communication models of Parallel platforms
No ratings yet
09 Communication models of Parallel platforms
25 pages
HPC-Unit-2
No ratings yet
HPC-Unit-2
72 pages
Introduction To High Performance Computing: Unit-I
No ratings yet
Introduction To High Performance Computing: Unit-I
70 pages
Limitation of Memory Sys Per
No ratings yet
Limitation of Memory Sys Per
38 pages
1.1 Processor Micro Architecture
No ratings yet
1.1 Processor Micro Architecture
21 pages
Unit 5
No ratings yet
Unit 5
44 pages
L1.0 HPC Overview
No ratings yet
L1.0 HPC Overview
58 pages
Architecture
No ratings yet
Architecture
21 pages
Hyper-Threading Technology: Processor Microarchitecture
No ratings yet
Hyper-Threading Technology: Processor Microarchitecture
18 pages
15CS72_ACA_Module2FinalCopy
No ratings yet
15CS72_ACA_Module2FinalCopy
29 pages
CA Final PDF
No ratings yet
CA Final PDF
13 pages
Cse410 10 Pipelining A
No ratings yet
Cse410 10 Pipelining A
7 pages
DDCO-Jan25-Unit5
No ratings yet
DDCO-Jan25-Unit5
30 pages
CHAPTER 2 Memory Hierarchy Design & APPENDIX B. Review of Memory Heriarchy
No ratings yet
CHAPTER 2 Memory Hierarchy Design & APPENDIX B. Review of Memory Heriarchy
73 pages
Chapter 4 (Processors and Memory Hierarchy)
100% (1)
Chapter 4 (Processors and Memory Hierarchy)
17 pages
09 Communication models of Parallel platforms
No ratings yet
09 Communication models of Parallel platforms
25 pages
CH02-COA10e Spring 2025
No ratings yet
CH02-COA10e Spring 2025
24 pages
Advanced Processor Architecture: Summer 1997
No ratings yet
Advanced Processor Architecture: Summer 1997
28 pages
Chap2 Slides Week3
No ratings yet
Chap2 Slides Week3
28 pages
Lecture 2 - Parallel Programming Platforms (Part I)
No ratings yet
Lecture 2 - Parallel Programming Platforms (Part I)
44 pages
2 Key Concepts: Assignments
No ratings yet
2 Key Concepts: Assignments
18 pages
A4 版本1 （未使用）
No ratings yet
A4 版本1 （未使用）
2 pages
COMP-unit-1
No ratings yet
COMP-unit-1
52 pages
OLP Notes
No ratings yet
OLP Notes
11 pages
Hyper-Threading Technology: Shaik Mastanvali (06951A0541)
No ratings yet
Hyper-Threading Technology: Shaik Mastanvali (06951A0541)
23 pages
Parallel Computing Platforms: Chieh-Sen (Jason) Huang
No ratings yet
Parallel Computing Platforms: Chieh-Sen (Jason) Huang
21 pages
CP4253 Map Unit Ii
No ratings yet
CP4253 Map Unit Ii
23 pages
Chapter 2 Part 2
No ratings yet
Chapter 2 Part 2
18 pages
MPMC Module 5
No ratings yet
MPMC Module 5
25 pages
Unit 5
No ratings yet
Unit 5
36 pages
Parallel N Distributed Systems
No ratings yet
Parallel N Distributed Systems
44 pages
Batch 2 ICS 2101 AND BIT 2102 (1) - 1
No ratings yet
Batch 2 ICS 2101 AND BIT 2102 (1) - 1
17 pages
Lecture 2 - Parallel Programming Platforms (Part I) - Updated - 2021
No ratings yet
Lecture 2 - Parallel Programming Platforms (Part I) - Updated - 2021
44 pages
Instructor: L. N. Bhuyan
No ratings yet
Instructor: L. N. Bhuyan
32 pages
Advanced Microprocessors and Microcontrollers: A. Narendiran
No ratings yet
Advanced Microprocessors and Microcontrollers: A. Narendiran
14 pages
Part 1 - Lecture 2 - Parallel Hardware
No ratings yet
Part 1 - Lecture 2 - Parallel Hardware
60 pages
Embedded System
No ratings yet
Embedded System
233 pages
IAS & MIPS Rate
No ratings yet
IAS & MIPS Rate
42 pages
Modern Processor and Memory Technology
No ratings yet
Modern Processor and Memory Technology
17 pages
Assignment 2
No ratings yet
Assignment 2
15 pages
Pipelining
No ratings yet
Pipelining
5 pages
Lecture # Pipelining
No ratings yet
Lecture # Pipelining
36 pages
Advanced Backend Code Optimization
From Everand
Advanced Backend Code Optimization
Sid Touati
No ratings yet
Kafka Developer Certified: The Essential Guide
From Everand
Kafka Developer Certified: The Essential Guide
SUJAN
No ratings yet
LPIC-3 Exam 306-300 Mastery: 500 Practice Questions on High Availability & Storage Clusters
From Everand
LPIC-3 Exam 306-300 Mastery: 500 Practice Questions on High Availability & Storage Clusters
Steve Brown
No ratings yet
Oracle 11g Streams Implementer's Guide
From Everand
Oracle 11g Streams Implementer's Guide
Ann L. R. McKinnell
No ratings yet
Operating Systems Interview Questions You'll Most Likely Be Asked
From Everand
Operating Systems Interview Questions You'll Most Likely Be Asked
Vibrant Publishers
No ratings yet
PRT TEST
No ratings yet
PRT TEST
4 pages
Servlet and JSP
No ratings yet
Servlet and JSP
42 pages
CO Unit 1
No ratings yet
CO Unit 1
43 pages
Complete Download New Media The Key Concepts 1st Edition Nicholas Gane PDF All Chapters
No ratings yet
Complete Download New Media The Key Concepts 1st Edition Nicholas Gane PDF All Chapters
67 pages
Brochure Medium Voltage
No ratings yet
Brochure Medium Voltage
80 pages
ASUG82475 - Removing Barriers To Customer-Centric Service With SAP Service Cloud
No ratings yet
ASUG82475 - Removing Barriers To Customer-Centric Service With SAP Service Cloud
17 pages
Erd Well Solutions
No ratings yet
Erd Well Solutions
2 pages
GTS+ ECU Flash Reprogramming Procedure: Service Category Engine/Hybrid System Section Market
No ratings yet
GTS+ ECU Flash Reprogramming Procedure: Service Category Engine/Hybrid System Section Market
19 pages
PT100 Transmitter For TNBC
No ratings yet
PT100 Transmitter For TNBC
3 pages
Operation Guide 3239/3240: About This Manual
No ratings yet
Operation Guide 3239/3240: About This Manual
3 pages
BCSE309L-FAT-E1-2024
No ratings yet
BCSE309L-FAT-E1-2024
3 pages
Introducing FortiSiem
No ratings yet
Introducing FortiSiem
69 pages
Sd200 Automation System
100% (1)
Sd200 Automation System
45 pages
Test Report: Digital Grid Smart Infrastructure Division
No ratings yet
Test Report: Digital Grid Smart Infrastructure Division
12 pages
What Are Arithmetic Micro-Operations
No ratings yet
What Are Arithmetic Micro-Operations
2 pages
Annexure FSP Boot Camp Nomination Form Template (1)
No ratings yet
Annexure FSP Boot Camp Nomination Form Template (1)
2 pages
Lab 1
No ratings yet
Lab 1
24 pages
OAU MEE 205 PQ -EDUREGARD.COM-
No ratings yet
OAU MEE 205 PQ -EDUREGARD.COM-
19 pages
Samsung Air Conditioning (Hotel, Residential & Office) - 1
No ratings yet
Samsung Air Conditioning (Hotel, Residential & Office) - 1
37 pages
Curriculum Vitae: A: Personal Particulars
No ratings yet
Curriculum Vitae: A: Personal Particulars
3 pages
Assignment 1
No ratings yet
Assignment 1
4 pages
Digital Signal Processing in RF Applications: Thomas Schilcher
No ratings yet
Digital Signal Processing in RF Applications: Thomas Schilcher
44 pages
Nintendo 3DS
No ratings yet
Nintendo 3DS
20 pages
STS BAHASA INGGRIS Kelas 1
No ratings yet
STS BAHASA INGGRIS Kelas 1
43 pages
Ficha Técnica Luxómetro HD450
No ratings yet
Ficha Técnica Luxómetro HD450
1 page
Facts:: Soledad V People (G.R. No. 184274, February 23, 2011)
No ratings yet
Facts:: Soledad V People (G.R. No. 184274, February 23, 2011)
1 page
Dania Batrisyia - Proposal - A
No ratings yet
Dania Batrisyia - Proposal - A
21 pages

Lecture (2) .PPT-1

Uploaded by

Lecture (2) .PPT-1

Uploaded by

Parallel Computing Platforms

• Implicit Parallelism: Trends in Microprocessor Architectures

• Conventional architectures coarsely comprise of a processor, memory system, and the

• different types of processors are microprocessor, microcontroller,

• Different of types of CPU are classified as single-core, dual-core,

• Higher levels of device integration have made available a large

• The question of how best to utilize these resources is an important

• Current processors use these resources in multiple functional units

• Pipelining overlaps various stages of instruction execution to achieve

• Pipelining, however, has several limitations.

• One simple way of alleviating these bottlenecks is to use multiple pipelines.

• Scheduling of instructions is determined by a number of factors:

• The hardware cost and complexity of the superscalar scheduler is a major

• It is very important to understand the difference between latency and

• Consider a processor operating at 1 GHz (1 ns clock) connected to a DRAM

• On the above architecture, consider the problem of computing a dot-product

Consider a processor operating at 1 GHz (1 ns clock) connected to a DRAM

Consider the architecture from the previous example. In this case, we

• The following observations can be made about the

You might also like