0% found this document useful (0 votes)

1 views71 pages

Module 5 Pipeline and Vector Processing

Module 5 discusses pipelining and vector processing, detailing how pipelining decomposes processes into concurrent suboperations using registers for isolation. It addresses pipeline conflicts, data dependency solutions, and the handling of branch instructions, emphasizing the role of compilers in optimizing performance. The module also covers supercomputers and array processors, highlighting their capabilities in numerical computations and parallel processing.

Uploaded by

skybound939

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

1 views71 pages

Module 5 Pipeline and Vector Processing

Uploaded by

skybound939

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 71

Module 5:

Pipeline and Vector Processing

Dr Shreema Shetty
Dept. of CSE
SCEM
Pipelining
● Pipelining is a technique of decomposing a sequential process into
suboperations, with each subprocess being executed in a special dedicated
segment that operates concurrently with all other segments.
● The overlapping of computation is made possible by associating a register
with each segment in the pipeline.
● The registers provide isolation between each segment so that each can operate
on distinct data simultaneously.
Perhaps the simplest way of viewing the pipeline structure is to imagine that each
segment consists of an input register followed by a combinational circuit.
o The register holds the data.
o The combinational circuit performs the suboperation in the particular
segment.
A clock is applied to all registers after enough time has elapsed to perform all
segment activity.
Each segment has one or two registers and a combinational circuit as shown in

Fig. 4.1.

The five registers are loaded with new data every clock pulse. The effect of

each clock is shown in Table 4-1.

Pipeline Conflicts
In general, there are three major difficulties that cause the instruction pipeline to
deviate from its normal operation.
o Resource conflicts caused by access to memory by two segments at the
same time.
Can be resolved by using separate instruction and data memories
o Data dependency conflicts arise when an instruction depends on the result of a
previous instruction, but this result is not yet available.
o Branch difficulties arise from branch and other instructions that change the value
of PC.
A difficulty that may cause a degradation of performance in an instruction pipeline
is due to possible collision of data or address.
o A data dependency occurs when an instruction needs data that are not yet
available.
o An address dependency may occur when an operand address cannot be
calculated because the information needed by the addressing mode is not
available.
Pipelined computers deal with such conflicts between data dependencies in a
variety of ways.
Data Dependency Solutions
Hardware interlocks: an interlock is a circuit that detects instructions whose source
operands are destinations of instructions farther up in the pipeline.
o This approach maintains the program sequence by using hardware to insert the
required delays.
Operand forwarding: uses special hardware to detect a conflict and then avoid it by
routing the data through special paths between pipeline segments.
o This method requires additional hardware paths through multiplexers as well as the
circuit that detects the conflict.
Delayed load: the compiler for such computers is designed to detect a data conflict and
reorder the instructions as necessary to delay the loading of the conflicting data by
inserting no-operation instructions.
Handling of Branch Instructions

One of the major problems in operating an instruction pipeline is the occurrence of

branch instructions. o An unconditional branch always alters the sequential
program flow by loading the program counter with the target address.

o In a conditional branch, the control selects the target instruction if the condition
is satisfied or the next sequential instruction if the condition is not satisfied.

Pipelined computers employ various hardware techniques to minimize the

performance degradation caused by instruction branching.

Prefetch target instruction: To prefetch the target instruction in addition to the
instruction following the branch. Both are saved until the branch is executed.
Branch target buffer(BTB): The BTB is an associative memory included in the
fetch segment of the pipeline.
o Each entry in the BTB consists of the address of a previously executed
branch instruction and the target instruction for that branch.
o It also stores the next few instructions after the branch target instruction.
Loop buffer: This is a small very high speed register file maintained by the instruction
fetch segment of the pipeline.
Branch prediction: A pipeline with branch prediction uses some additional logic to guess
the outcome of a conditional branch instruction before it is executed.
Delayed branch: in this procedure, the compiler detects the branch instructions and
rearranges the machine language code sequence by inserting useful instructions that
keep the pipeline operating without interruptions.
o A procedure employed in most RISC processors.
o e.g. no-operation instruction
RISC Pipeline
To use an efficient instruction pipeline
o To implement an instruction pipeline using a small number of suboperations, with each
being executed in one clock cycle.
o Because of the fixed-length instruction format, the decoding of the operation can occur
at the same time as the register selection.
o Therefore, the instruction pipeline can be implemented with two or three segments.
One segment fetches the instruction from program memory .
The other segment executes the instruction in the ALU
Third segment may be used to store the result of the ALU
operation in a destination register
The data transfer instructions in RISC are limited to load and store instructions.

o These instructions use register indirect addressing. They usually need

three or four stages in the pipeline.

o To prevent conflicts between a memory access to fetch an instruction

and to load or store an operand, most RISC machines use two separate

buses with two memories.

o Cache memory: operate at the same speed as the CPU clock

One of the major advantages of RISC is its ability to execute instructions at

the rate of one per clock cycle.

o In effect, it is to start each instruction with each clock cycle and to

pipeline the processor to achieve the goal of single-cycle instruction

execution.

o RISC can achieve pipeline segments, requiring just one clock cycle.
Compiler supported that translates the high-level language program into

machine language program.

o Instead of designing hardware to handle the difficulties associated with

data conflicts and branch penalties.

o RISC processors rely on the efficiency of the compiler to detect and

minimize the delays encountered with these problems.

Example: Three-Segment Instruction Pipeline
Three are three types of instructions:
o The data manipulation instructions: operate on data in processor registers
o The data transfer instructions:
o The program control instructions:
The control section fetches the instruction from program memory into an instruction
register.
o The instruction is decoded at the same time that the registers needed for the execution
of the instruction are selected.
The processor unit consists of a number of registers and an arithmetic logic unit (ALU).
A data memory is used to load or store the data from a selected register in the
register file.The instruction cycle can be divided into three suboperations and
implemented in
three segments:
o I: Instruction fetch
Fetches the instruction from program memory
o A: ALU operation
The instruction is decoded and an ALU operation is performed.
It performs an operation for a data manipulation instruction.
It evaluates the effective address for a load or store instruction.
It calculates the branch address for a program control instruction.
E: Execute instruction
Directs the output of the ALU to one of three destinations, depending on the
decoded instruction.
It transfers the result of the ALU operation into a destination register in the register
file.
It transfers the effective address to a data memory for loading or storing.
It transfers the branch address to the program counter.
● It is up to the compiler to make sure that the instruction following the load

instruction uses the data fetched from memory.

● This concept of delaying the use of the data loaded from memory is referred
to as delayed load.
● Thus the no-op instruction is used to advance one clock cycle in order to

compensate for the data conflict in the pipeline.

● The advantage of the delayed load approach is that the data dependency is
taken care of by the compiler rather than the hardware.
Delayed Branch
The method used in most RISC processors is to rely on the compiler to redefine
the branches so that they take effect at the proper time in the pipeline. This
method is referred to as delayed branch.
The compiler is designed to analyze the instructions before and after the branch
and rearrange the program sequence by inserting useful instructions in the delay
steps.
It is up to the compiler to find useful instructions to put after the branch
instruction. Failing that, the compiler can insert no-op instructions.
An Example of Delayed Branch
The program for this example consists of five instructions.
o Load from memory to R1
o Increment R2
o Add R3 to R4
o Subtract R5 from R6
o Branch to address X
To achieve the required level of high performance it is necessary to utilize the

fastest and most reliable hardware and apply innovative procedures from vector

and parallel processing techniques.

Supercomputers
A commercial computer with vector instructions and pipelined floating-point
arithmetic operations is referred to as a supercomputer.
o To speed up the operation, the components are packed tightly together to
minimize the distance that the electronic signals have to travel.
This is augmented by instructions that process vectors and combinations of
scalars and vectors. A supercomputer is a computer system best known for its
high computational speed, fast and large memory systems, and the extensive use
of parallel processing.
o It is equipped with multiple functional units and each unit has its own pipeline
configuration.
It is specifically optimized for the type of numerical calculations involving vectors and
matrices of floating-point numbers. They are limited in their use to a number of scientific
applications, such as numerical weather forecasting, seismic wave analysis, and space
research. A measure used to evaluate computers in their ability to perform a given
number of floating-point operations per second is referred to as flops.

A typical supercomputer has a basic cycle time of 4 to 20 ns.

The examples of supercomputer:

Cray-1: it uses vector processing with 12 distinct functional units in parallel; a large
number of registers (over 150); multiprocessor configuration (Cray X-MP and Cray Y-MP)

o Fujitsu VP-200: 83 vector instructions and 195 scalar instructions; 300

Megaflops
Array Processing
An array processor is a processor that performs computations on large arrays of
data.
The term is used to refer to two different types of processors.
o Attached array processor: Is an auxiliary processor. It is intended to improve the
performance of the host computer in specific numerical computation tasks.

o SIMD array processor: Has a single-instruction multiple-data organization.

It manipulates vector instructions by means of multiple functional
units responding to a common instruction.
Attached Array Processor

Its purpose is to enhance the performance of the computer by providing vector

processing for complex scientific applications.

o Parallel processing with multiple functional units

Fig. 4-14 shows the interconnection of an attached array processor to a host computer.

For example, when attached to a VAX 11 computer, the FSP-164/MAX from

Floating-Point Systems increases the computing power of the VAX to 100megaflops.

The objective of the attached array processor is to provide vector manipulation

capabilities to a conventional computer at a fraction of the cost of supercomputer.
Fig 4-14: Attached array processor with host computer
SIMD Array Processor
An SIMD array processor is a computer with multiple processing units operating
in parallel.
A general block diagram of an array processor is shown in Fig. 4-15.
o It contains a set of identical processing elements (PEs), each having a
local memory M.
o Each PE includes an ALU, a floating-point arithmetic unit, and working
registers.
o Vector instructions are broadcast to all PEs simultaneously.
Masking schemes are used to control the status of each PE during the execution of
vector instructions.
o Each PE has a flag that is set when the PE is active and reset when the PE is inactive.

For example, the ILLIAC IV computer developed at the University of Illinois and
manufactured by the Burroughs Corp.
o Are highly specialized computers.
o They are suited primarily for numerical problems that can be expressed in
vector or matrix form.

DLCO Module 6 Sem 3
No ratings yet
DLCO Module 6 Sem 3
40 pages
Bus Pass Application Format
86% (7)
Bus Pass Application Format
2 pages
3 Pipeline
No ratings yet
3 Pipeline
38 pages
12 - Processor Structure and Function
No ratings yet
12 - Processor Structure and Function
73 pages
Unit 6
No ratings yet
Unit 6
22 pages
CH 12.ppt Type I
No ratings yet
CH 12.ppt Type I
54 pages
Unit-3
No ratings yet
Unit-3
94 pages
UNIT2-ACA
No ratings yet
UNIT2-ACA
118 pages
Instruction Pipelining
No ratings yet
Instruction Pipelining
21 pages
Unit 7 - Basic Processing
No ratings yet
Unit 7 - Basic Processing
85 pages
Computer Architecture: Pipelining: Dr. Ashok Kumar Turuk
No ratings yet
Computer Architecture: Pipelining: Dr. Ashok Kumar Turuk
136 pages
FINAL Presentation
No ratings yet
FINAL Presentation
31 pages
UNIT 6
No ratings yet
UNIT 6
20 pages
CH7-Parallel and Pipelined Processing
No ratings yet
CH7-Parallel and Pipelined Processing
23 pages
Slides Chapter 6 Pipelining
No ratings yet
Slides Chapter 6 Pipelining
60 pages
Computer System Organization
No ratings yet
Computer System Organization
26 pages
Co - Unit Ii - Ii
No ratings yet
Co - Unit Ii - Ii
34 pages
CH10-Processor Structure and Function
No ratings yet
CH10-Processor Structure and Function
14 pages
3.2 Pipeline Processing
No ratings yet
3.2 Pipeline Processing
18 pages
Unit 5 - Pipeling and Multipoessors
No ratings yet
Unit 5 - Pipeling and Multipoessors
74 pages
Pipeline & Parallel Processing
No ratings yet
Pipeline & Parallel Processing
19 pages
Concept of Pipelining - Computer Architecture Tutorial What Is Pipelining?
100% (1)
Concept of Pipelining - Computer Architecture Tutorial What Is Pipelining?
5 pages
COA Unit - V Notes
No ratings yet
COA Unit - V Notes
21 pages
Chapter_4
No ratings yet
Chapter_4
78 pages
11 Processor Structure and Function 20 3 18
No ratings yet
11 Processor Structure and Function 20 3 18
27 pages
10_Pipelining
No ratings yet
10_Pipelining
44 pages
William Stallings Computer Organization and Architecture 8 Edition Processor Structure and Function
No ratings yet
William Stallings Computer Organization and Architecture 8 Edition Processor Structure and Function
74 pages
SIMD Machines:: Pipeline System
No ratings yet
SIMD Machines:: Pipeline System
35 pages
Chapter 3 PPTV 31 Sem IIv 31
No ratings yet
Chapter 3 PPTV 31 Sem IIv 31
40 pages
Pipeline Hazards (1)
No ratings yet
Pipeline Hazards (1)
53 pages
Pipelining New.pptx
No ratings yet
Pipelining New.pptx
33 pages
4-Concept of Pipelining
No ratings yet
4-Concept of Pipelining
20 pages
Dpco Unit 4
No ratings yet
Dpco Unit 4
21 pages
Pipelining2019_(1)[1]
No ratings yet
Pipelining2019_(1)[1]
82 pages
SRM Pipelining 05.Pptx
No ratings yet
SRM Pipelining 05.Pptx
42 pages
PIPELINING
No ratings yet
PIPELINING
30 pages
6145 Single Station Installationand Service Guide
No ratings yet
6145 Single Station Installationand Service Guide
126 pages
Unit 5
No ratings yet
Unit 5
23 pages
Pipelinehazard 160823134502
No ratings yet
Pipelinehazard 160823134502
61 pages
Coa Unit 4
No ratings yet
Coa Unit 4
10 pages
Pipe Lining
No ratings yet
Pipe Lining
14 pages
Ch2 Lec7 Instruction Piplining
No ratings yet
Ch2 Lec7 Instruction Piplining
34 pages
Unit - 1 Microprocessor Architecture
No ratings yet
Unit - 1 Microprocessor Architecture
52 pages
Instruction Pipelining
No ratings yet
Instruction Pipelining
32 pages
CA unit-2 Chapter-2
No ratings yet
CA unit-2 Chapter-2
36 pages
ca 5
No ratings yet
ca 5
12 pages
Silver Oak College of Engineering and Technology: Computer Organization Module Solution - 4
No ratings yet
Silver Oak College of Engineering and Technology: Computer Organization Module Solution - 4
11 pages
PIpeline Processing and Multi Processing
No ratings yet
PIpeline Processing and Multi Processing
16 pages
Instruction Pipeline Design, Arithmetic Pipeline Deign - Super Scalar Pipeline Design
No ratings yet
Instruction Pipeline Design, Arithmetic Pipeline Deign - Super Scalar Pipeline Design
34 pages
Pipelining
No ratings yet
Pipelining
44 pages
Pipelinehazard For Class
No ratings yet
Pipelinehazard For Class
61 pages
CO Pipelining PDF notes
No ratings yet
CO Pipelining PDF notes
10 pages
Caterpillar
100% (5)
Caterpillar
220 pages
5.Pipeline and Multiprocessors
No ratings yet
5.Pipeline and Multiprocessors
16 pages
Ca06 2014 PDF
No ratings yet
Ca06 2014 PDF
53 pages
Week 4 - Pipelining
No ratings yet
Week 4 - Pipelining
44 pages
Chapter 8 - Pipelining
No ratings yet
Chapter 8 - Pipelining
38 pages
Instruction Level Parallelism: Pipelining
No ratings yet
Instruction Level Parallelism: Pipelining
6 pages
Operating Systems Foundations With Linux On The Raspberry Pi
100% (1)
Operating Systems Foundations With Linux On The Raspberry Pi
344 pages
Enhancing Performance With Pipelining
No ratings yet
Enhancing Performance With Pipelining
85 pages
COA Unit 3
No ratings yet
COA Unit 3
89 pages
Design of 3 Stage Pipelining Processor Using VHDL
No ratings yet
Design of 3 Stage Pipelining Processor Using VHDL
22 pages
Bruel & Kjaer Technical Review
0% (1)
Bruel & Kjaer Technical Review
48 pages
PSW Register in 8051 Microcontroller - Program Status Word
No ratings yet
PSW Register in 8051 Microcontroller - Program Status Word
10 pages
Designing Product Layouts - Line Balancing
No ratings yet
Designing Product Layouts - Line Balancing
6 pages
Virtual Reality With Unity
100% (4)
Virtual Reality With Unity
390 pages
Sangfor HCI DataSheet Software Based 1 - 2
No ratings yet
Sangfor HCI DataSheet Software Based 1 - 2
2 pages
Computer GK Questions For Class 4
No ratings yet
Computer GK Questions For Class 4
4 pages
TS-999 v1.5.2 User Manual PDF
No ratings yet
TS-999 v1.5.2 User Manual PDF
8 pages
Wiring Diagram 8 - Fan ECU 1
No ratings yet
Wiring Diagram 8 - Fan ECU 1
2 pages
5-OptiX Imanager T2000 V2R7C01 FAQ-20081208-A
No ratings yet
5-OptiX Imanager T2000 V2R7C01 FAQ-20081208-A
45 pages
Tle 7 8 Ict Css q1 m2 For Printing
No ratings yet
Tle 7 8 Ict Css q1 m2 For Printing
33 pages
LearnEMC - Introduction To EMC Regulations and Standards
No ratings yet
LearnEMC - Introduction To EMC Regulations and Standards
1 page
Hakin9 05 2010 EN
No ratings yet
Hakin9 05 2010 EN
50 pages
P5206 - Wireless Guitar Controller - OCR
No ratings yet
P5206 - Wireless Guitar Controller - OCR
12 pages
Intro To Computing Course Outline
No ratings yet
Intro To Computing Course Outline
3 pages
Network Settings Using Windows 2003
No ratings yet
Network Settings Using Windows 2003
4 pages
Windows 8.1 AIO x86 Jan2014
No ratings yet
Windows 8.1 AIO x86 Jan2014
2 pages
Overview of Cisco 2800 Series Routers
No ratings yet
Overview of Cisco 2800 Series Routers
3 pages
Gdec
No ratings yet
Gdec
16 pages
Binary Systems: Atty. Manuel O. Diaz JR
No ratings yet
Binary Systems: Atty. Manuel O. Diaz JR
44 pages
Troubleshooting Single Engine Externally Regulated Alternators
No ratings yet
Troubleshooting Single Engine Externally Regulated Alternators
1 page
Viscosimetro Ofite
No ratings yet
Viscosimetro Ofite
20 pages
Bid Submission Sheet For LAN Set Up Final
No ratings yet
Bid Submission Sheet For LAN Set Up Final
18 pages
1500 Watt Front-End Power Supply: TCP1U-1500
No ratings yet
1500 Watt Front-End Power Supply: TCP1U-1500
2 pages
7ST30I00
No ratings yet
7ST30I00
8 pages
Microprocessor Lab Manual
No ratings yet
Microprocessor Lab Manual
83 pages
Laser Net Software
No ratings yet
Laser Net Software
2 pages
Computer Science II Essentials
From Everand
Computer Science II Essentials
Randall Raus
No ratings yet
Preliminary Specifications: Programmed Data Processor Model Three (PDP-3) October, 1960
From Everand
Preliminary Specifications: Programmed Data Processor Model Three (PDP-3) October, 1960
Digital Equipment Corporation
No ratings yet

Module 5 Pipeline and Vector Processing

Uploaded by

Module 5 Pipeline and Vector Processing

Uploaded by

Module 5:

Pipeline and Vector Processing

each clock is shown in Table 4-1.

One of the major problems in operating an instruction pipeline is the occurrence of

Pipelined computers employ various hardware techniques to minimize the

performance degradation caused by instruction branching.

o These instructions use register indirect addressing. They usually need

three or four stages in the pipeline.

o To prevent conflicts between a memory access to fetch an instruction

buses with two memories.

o Cache memory: operate at the same speed as the CPU clock

the rate of one per clock cycle.

o In effect, it is to start each instruction with each clock cycle and to

pipeline the processor to achieve the goal of single-cycle instruction

machine language program.

o Instead of designing hardware to handle the difficulties associated with

data conflicts and branch penalties.

o RISC processors rely on the efficiency of the compiler to detect and

minimize the delays encountered with these problems.

instruction uses the data fetched from memory.

compensate for the data conflict in the pipeline.

and parallel processing techniques.

A typical supercomputer has a basic cycle time of 4 to 20 ns.

The examples of supercomputer:

o Fujitsu VP-200: 83 vector instructions and 195 scalar instructions; 300

o SIMD array processor: Has a single-instruction multiple-data organization.

Its purpose is to enhance the performance of the computer by providing vector

o Parallel processing with multiple functional units

For example, when attached to a VAX 11 computer, the FSP-164/MAX from

The objective of the attached array processor is to provide vector manipulation

You might also like