0% found this document useful (0 votes)
3 views

Chapter_4

Chapter 4 discusses the organization and functioning of processors, detailing the steps involved in instruction execution including fetching, interpreting, processing, and writing data. It covers the internal structure of the CPU, including registers, the instruction cycle, and pipelining strategies to enhance performance. Additionally, it addresses pipeline hazards and the concepts of scalar, vector, and superscalar processors, emphasizing the importance of instruction-level parallelism.

Uploaded by

fredrick.juston
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

Chapter_4

Chapter 4 discusses the organization and functioning of processors, detailing the steps involved in instruction execution including fetching, interpreting, processing, and writing data. It covers the internal structure of the CPU, including registers, the instruction cycle, and pipelining strategies to enhance performance. Additionally, it addresses pipeline hazards and the concepts of scalar, vector, and superscalar processors, emphasizing the importance of instruction-level parallelism.

Uploaded by

fredrick.juston
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 78

Chapter 4

The Processor

+
+ Processor Organization
Processor Requirements:
■ Fetch instruction
■ The processor reads an instruction from memory (register, cache, main memory)

■ Interpret instruction
■ The instruction is decoded to determine what action is required

■ Fetch data
■ The execution of an instruction may require reading data from memory or an I/O
module

■ Process data
■ The execution of an instruction may require performing some arithmetic or logical
operation on data

■ Write data
■ The results of an execution may require writing data to memory or an I/O module

■ In order to do these things the processor needs to store some data


temporarily and therefore needs a small internal memory
+ The CPU and the System Bus
+ Internal Structure of the CPU
+
Register Organization
■ Within the processor there is a set of registers that
function as a level of memory above main memory and
cache in the hierarchy

■ The registers in the processor perform two roles:

User-Visible Registers Control and Status Registers

■ Enable the machine or ■ Used by the control unit to


assembly language control the operation of the
programmer to minimize processor and by privileged
main memory references by operating system programs
optimizing use of registers to control the execution of
programs
User-Visible Registers

Categories:

• General purpose
Referenced by means of the • Can be assigned to a variety of functions by the
machine language that the programmer
processor executes • Data
• May be used only to hold data and cannot be
employed in the calculation of an operand address
• Address
• May be somewhat general purpose or may be
devoted to a particular addressing mode
• Examples: segment pointers, index registers,
stack pointer
• Condition codes
• Also referred to as flags
• Bits set by the processor hardware as the result of
operations
Condition Codes
+
Control and Status Registers
Four registers are essential to instruction
execution:
■Program counter (PC)
■ Contains the address of an instruction to be fetched

■Instruction register (IR)


■ Contains the instruction most recently fetched

■Memory address register (MAR)


■ Contains the address of a location in memory

■Memory buffer register (MBR)


■ Contains a word of data to be written to memory or the
word most recently read
+ Program Status Word (PSW)

Register or set of registers that contain


status information

Common fields or flags include:


• Sign
• Zero
• Carry
• Equal
• Overflow
• Interrupt Enable/Disable
• Supervisor
+Example Microprocessor Register
Organization
Includes the following
stages:
Instruction
Cycle

Fetch Execute Interrupt

If interrupts are enabled and


Read the next instruction Interpret the opcode and an interrupt has occurred,
from memory into the perform the indicated save the current process
processor operation state and service the
interrupt
+ The Instruction
Cycle

• The main line of activity consists of alternating instruction fetch and instruction
execution activities.
• After an instruction is fetched, it is examined to determine if any indirect addressing is
involved. If so, the required operands are fetched using indirect addressing.
• Following execution, an interrupt may be processed before the next instruction fetch.
+
Instruction Cycle State Diagram
+
Data Flow, Fetch Cycle
+
Data Flow, Indirect Cycle
+
Data Flow, Interrupt Cycle
Pipelining Strategy

To apply this concept to


instruction execution we
Similar to the use of an must recognize that an
assembly line in a instruction has a number
manufacturing plant of stages

New inputs are accepted


at one end before
previously accepted
inputs appear as outputs
at the other end
+
Traditional Pipeline Concept

■ Laundry Example
■ Ann,Brian, Cathy, Dave
each have one load of clothes
to wash, dry, and fold
■ Washer takes 30 minutes A B C D

■ Dryer takes 40 minutes

■ “Folder” takes 20 minutes


+
Traditional Pipeline Concept

6 7 8 9 1 1 Midni
PM Tim 0 1 ght

3 4 2 3 4 e2 3 4 ■
2 3 4 2
Sequential laundry takes 6

A 0 0 0 0 0 0 0 0 0 0for 4 loads
hours 0 0
■ If they learned pipelining, how
long would laundry take?
B

D
+
Traditional Pipeline Concept

6 7 8 9 1 11 Midnigh
PM 0 t
Time
T
a 3 4 4 4 4 2
s 0 0 0 0 0 0
k A
■Pipelined laundry takes
3.5 hours for 4 loads
O B
r
d C
e
r D
+
Traditional Pipeline Concept

6 7 8 9
■ Pipeliningdoesn’t help
PM Tim latency of single task, it
T helps throughput of entire
3 4 4 4 4e 2 workload
a
A 0 0 0 0 0 0 ■ Pipeline
rate limited by
s slowest pipeline stage
k ■ Multipletasks operating
B simultaneously using
different resources
O
C ■ Potential
speedup =
r Number pipe stages
d D ■ Unbalanced lengths of pipe
e stages reduces speedup

r
+
Use the Idea of Pipelining in a
Computer
Fetch +
Execution
I I2 I3
Tim
1 e
Tim
Clock 1 2 3 4 e
F E F E F E cycle
1 1 2 2 3 3 Instruction

I1 F1 E1
(a) Sequential
execution

I2 F2 E2
Interstage
buffer B
1 I3 F3 E3

Instructio E ecutio
n fetc x nuni
(c) Pipelined
huni t execution
t
Basic idea of instruction pipelining.

(b) Hardware
organization
+
Use the Idea of Pipelining in a Computer-
Two-Stage Instruction Pipeline
+

Fetch + Decode
+ Execution +
Write
+ Additional Stages
■ Fetch instruction (FI)
■ Read the next expected ■ Fetch operands (FO)
instruction into a buffer ■ Fetch each operand from
memory
■ Decode instruction (DI) ■ Operands in registers
■ Determine the opcode need not be fetched
and the operand
specifiers ■ Execute instruction (EI)
■ Perform the indicated
■ Calculate operands (CO)
operation and store the
■ Calculate the effective result, if any, in the
address of each source specified destination
operand operand location
■ This may involve
displacement, register ■ Write operand (WO)
indirect, indirect, or other ■ Store the result in
forms of address memory
calculation
+
Timing Diagram for Instruction
Pipeline Operation
+
The Effect of a conditional
Branch on Instruction Pipeline
Operation
+
An alternative Pipeline
Depiction
+

© 2016 Pearson Education, Inc., Hoboken, NJ. All rights reserved.


+Six-Stage CPU Instruction
Pipeline
Pipeline Hazards
A pipeline hazard occurs
when the pipeline, or
some portion of the There are three types of
pipeline, must stall hazards:
because conditions do not • Resource
permit continued • Data
execution • Control

Such a pipe- line stall is


also referred to as a
pipeline bubble
+
Example of Resource Hazard
+
Example of Data Hazard

Clock cycle
1 2 3 4 5 6 7 8 9 10
ADD EAX, EBX FI DI FO EI WO
SUB ECX, EAX FI DI Idle FO EI WO
I3 FI DI FO EI WO
I4 FI DI FO EI WO

ADD EAX, EBX /* EAX = EAX + EBX

SUB ECX, EAX /* ECX = ECX – EAX


+ Types of Data Hazard

■ Read after write (RAW), or true dependency


■ An instruction modifies a register or memory location
■ Succeeding instruction reads data in memory or register
location
■ Hazard occurs if the read takes place before write operation is
complete

■ Write after read (WAR), or antidependency


■ An instruction reads a register or memory location
■ Succeeding instruction writes to the location

■ Hazard occurs if the write operation completes before the read


operation takes place

■ Write after write (WAW), or output dependency


■ Two instructions both write to the same location
■ Hazard occurs if the write operations take place in the reverse
order of the intended sequence
+
Control Hazard

■Also known as a branch hazard

■Occurs when the pipeline makes the wrong decision on


a branch prediction

■Brings instructions into the pipeline that must


subsequently be discarded

■Dealing with Branches:


■ Multiple streams
■ Prefetch branch target
■ Loop buffer
■ Branch prediction
■ Delayed branch
Multiple Streams
A simple pipeline suffers a penalty for a branch
instruction because it must choose one of two
instructions to fetch next and may make the
wrong choice

A brute-force approach is to replicate the initial


portions of the pipeline and allow the pipeline to
fetch both instructions, making use of two
streams

Drawbacks:
• With multiple pipelines there are contention delays for access
to the registers and to memory
• Additional branch instructions may enter the pipeline before
the original branch decision is resolved
Prefetch Branch Target

■When a conditional branch is recognized,


the target of the branch is prefetched, in
addition to the instruction following the
branch

■Target is then saved until the branch


instruction is executed

■If the branch is taken, the target has


already been prefetched

+ ■IBM 360/91 uses this approach


+
Loop Buffer
■Small, very-high speed memory maintained by the
instruction fetch stage of the pipeline and containing the
n most recently fetched instructions, in sequence

■Benefits:
■ Instructions fetched in sequence will be available without the
usual memory access time
■ If a branch occurs to a target just a few locations ahead of the
address of the branch instruction, the target will already be in
the buffer
■ This strategy is particularly well suited to dealing with loops

■Similar in principle to a cache dedicated to instructions


■ Differences:
■ The loop buffer only retains instructions in sequence
■ Is much smaller in size and hence lower in cost
Branch address

Instruction to be
8 decoded in case of hit
Loop Buffer
(256 bytes)

Most significant address bits


compared to determine a bit
+ Branch Prediction
■Various techniques can be used to predict whether a
branch will be taken:

1. Predict never taken ■ These approaches are static


2. Predict always taken ■ They do not depend on the
execution history up to the time
3. Predict by opcode of the conditional branch
instruction
4. Taken/not taken switch
■ These approaches are dynamic
5. Branch history table
■ They depend on the execution
history
+
Branch Prediction Flow Chart
+
Branch Prediction State
Diagram
+ Dealing with branches
Intel 80486 Pipelining
Fetch
Objective is to fill the prefetch buffers with new data as soon as the old Operates independently of the other stages to keep the prefetch buffers
data have been consumed by the instruction decoder full

Decode stage 1
All opcode and addressing-mode information is 3 bytes of instruction are passed to the D1 stage D1 decoder can then direct the D2 stage to
decoded in the D1 stage from the prefetch buffers capture the rest of the instruction

Decode stage 2

Expands each opcode into control signals for the ALU Also controls the computation of the more complex addressing modes

Execute

Stage includes ALU operations, cache access, and register update

Write back

Updates registers and status flags modified during the preceding execute stage
+ 80486 Instruction Pipeline
Examples
Instruction Level
Parallelism
and Superscalar
Processors
+
Scalar Processor

■A scalar processor processes only one data item at


a time, with typical data items being integers or
floating point numbers.

■A scalar processor is classified as a single instruction,


single data (SISD) processor in Flynn's taxonomy.

■The Intel 486 is an example of a scalar processor.


+
Vector Processor

■A vector processor acts on several pieces of data with a


single instruction (SIMD).

■Vector processors were popular for supercomputers in


the 1980s and 1990s because they efficiently handled
the long vectors of data common in scientific
computations, and they are heavily used now
in graphics processing units (GPUs).
+ What is Superscalar?

■A superscalar processor contains multiple copies of


the datapath hardware to execute multiple instructions
simultaneously.

■Common instructions (arithmetic, load/store, conditional


branch) can be initiated simultaneously and executed
independently

■Applicable to both RISC & CISC


+ Why Superscalar?

■Most operations are on scalar quantities

■Improve these operations by executing them concurrently in


multiple pipelines

■Requires multiple functional units

■Requires re-arrangement of instructions


+
Scalar Organization
+
Superscalar Organization
+ Superpipelined

■Many pipeline stages need less than half a clock cycle

■Double internal clock speed gets two tasks per external


clock cycle

■Superscalar allows parallel fetch and execute


+
Comparison of Superscalar and
Superpipeline Approaches
+ Limitations
■Instruction level parallelism: the degree to which the
instructions can be executed parallel (in theory)

■To achieve it:


■ Compiler based optimisation
■ Hardware techniques

■Limited by
■ Data dependency
■ Procedural dependency
■ Resource conflicts
+ True Data (Write-Read)
■Dependency
ADD r1, r2 (r1 🡨 r1 + r2)
■MOV r3, r1 (r3 🡨 r1)
■Can fetch and decode second instruction
in parallel with first but cannot execute
second instruction until first is finished
■Also called Flow dependency or RAW
dependency
+ Procedural Dependency
■Cannot execute instructions after a (conditional)
branch in parallel with instructions before a
branch
■Also, if instruction length is not fixed, instructions
have to be decoded to find out how many fetches
are needed (eg. RISC)
■This prevents simultaneous fetches
+Resource Conflict
■Two or more instructions requiring access to the same
resource at the same time
■ e.g. functional units, registers, bus

■Similar to true data dependency, but it is possible to


duplicate resources
+ Design Issues
■Instruction level parallelism
■ Some instructions in a sequence are independent
■ Execution can be overlapped or re-ordered
■ Governed by data and procedural dependency

■Machine Parallelism
■ Ability to take advantage of instruction level parallelism
■ Governed by number of parallel pipelines
+
Example of the concept of
instruction-level parallelism
■As an example of the concept of instruction-level
parallelism, consider the following two code fragments

■Load R1 🡨 R2 Add R3 🡨 R3, “1”

■Add R3 🡨 R3, “1” Add R4 🡨 R3, R2

■Add R4 🡨 R4, R2 Store [R4] 🡨 R0


+
Instructional Level parallelism
(contd..)
■ Thedegree of instruction-level parallelism is
determined by the frequency of true data dependencies
and procedural dependencies in the code.
■ These factors, in turn, are dependent on the instruction
set architecture and on the application.
■ Instruction-levelparallelism is also determined by
operation latency: the time until the result of an
instruction is available for use as an operand in a
subsequent instruction.
■ Thelatency determines how much of a delay a data or
procedural dependency will cause.
+
Instruction Issue Policy

■ Threetypes of orderings are important in this


regard:
■ Theorder in which instructions are fetched
■ The order in which instructions are executed
■ The order in which instructions update the contents of
register and memory locations.

■ Ingeneral terms, we can group superscalar


instruction issue policies into the following
categories:
■ In-order issue with in-order completion.
■ In-order issue with out-of-order completion.
■ Out-of-order issue with out-of-order completion.
+ (Re-)ordering instructions

■Order in which instructions are fetched

■Order in which instructions are executed – instruction issue

■Order in which instructions change registers and memory -


commitment or retiring
+ In-Order Issue
In-Order Completion
■Issue instructions in the order they occur

■Not very efficient – not used in practice

■May fetch >1 instruction

■Instructions must stall if necessary


+ An Example

■I1 requires two cycles to execute

■I3 and I4 compete for the same execution unit

■I5 depends on the value produced by I4

■I5 and I6 compete for the same execution unit

Two fetch and write units, three execution units


+ In-Order Issue In-Order Completion
(Diagram)
+ In-Order Issue Out-of-Order
Completion (Diagram)
+ In-Order Issue
Out-of-Order Completion
■Output (write-write) dependency
■ R3 🡨 R3 + R5 (I1)
■ R4 🡨 R3 + 1 (I2)
■ R3 🡨 R5 + 1 (I3)
■ R6 🡨 R3 + 1 (I4)
■ I2 depends on result of I1 - data dependency
■ If I3 completes before I1, the input for I4 will be wrong - output
dependency
+ Out-of-Order Issue
Out-of-Order Completion
■Decouple decode pipeline from execution pipeline

■Can continue to fetch and decode until this pipeline is full

■When a execution unit becomes available an instruction can


be executed

■Since instructions have been decoded, processor can look


ahead – instruction window
+ Out-of-Order Issue Out-of-Order
Completion (Diagram)
+
Antidependency (WAR)

■Write-Read dependency: I2-I3


■ R3 🡨 R3 + R5 (I1)
■ R4 🡨 R3 + 1 (I2)
■ R3 🡨 R5 + 1 (I3)
■ R7 🡨 R3 + R4 (I4)
■ I3 should not execute before I2 starts as I2 needs a value in R3
and I3 changes R3
+ Register Renaming

■Output and antidependencies occur because register


contents may not reflect the correct program flow

■May result in a pipeline stall

■The usual reason is storage conflict

■Registers can be allocated dynamically


+ Register Renaming example
■R3b 🡨 R3a + R5a (I1)

■R4b 🡨 R3b + 1 (I2)

■R3c 🡨 R5a + 1 (I3)

■R7b 🡨 R3c + R4b (I4)

■Without label (a,b,c) refers to logical register

■With label is hardware register allocated

■Removes antidependency I2-I3 and output dependency


I1&I3-I4

■Needs extra registers


+ Machine Parallelism

■The 3 techniques performance enhancement of Superscalar


processors are:
■ Duplication of Resources
■ Out of order issue
■ Renaming

■Not worth duplicating functions without register renaming

■Need instruction window large enough (more than 8)


+
Speedups of Various Machine
Organizations without
Procedural Dependencies

© 2016 Pearson Education, Inc., Hoboken, NJ. All rights reserved.


+ Branch Prediction

■Intel 80486 fetches both next sequential instruction after


branch and branch target instruction

■Gives two cycle delay if branch taken (two decode cycles)


+ RISC - Delayed Branch
■Calculate result of branch before unusable instructions pre-
fetched

■Always execute single instruction immediately following


branch

■Keeps pipeline full while fetching new instruction stream

■Not as good for superscalar


■ Multiple instructions need to execute in delay slot

■Revert to branch prediction


+ Superscalar Execution

You might also like