COA UNIT - III Processor and Control Unit
COA UNIT - III Processor and Control Unit
UNIT III
1
TEXTBOOKS & REFERENCES
Text Books
Carl Hamacher, Zvonko G. Vranesic, Safwat G. Zaky, Computer Organization, 5th edition, McGraw-Hill, 2014
DavidA.PattersonandJohnL.Henessey,“ComputerOrganisationandDesign”,Fifthedition,Morgan Kauffman /
Elseveir,2014.
Morris Mano, "Computer System Architecture ", Prentice Hall of India, Third Edition,2008.
Reference Books
Morris Mano, "Computer System Architecture ", Prentice Hall of India, Third Edition,2008.
John P.Hayes, Computer Architecture and Organisation, McGraw Hill,2012.
William Stallings, Computer Organization and Architecture, 7th edition, Prentice-Hall of India Pvt. Ltd., 2016.
Vincent P. Heuring, Harry F. Jordan, “Computer System Architecture”, Second Edition, Pearson Education,2005.
Govindarajalu, “Computer Architecture and Organization, Design Principles and Applications", First edition, Tata
McGraw Hill, New Delhi,2005.
Web References
http://www.inetdaemon.com/tutorials/computers/hardware/cpu/
https://inst.eecs.berkeley.edu/~cs152/sp18/
http://users.ece.cmu.edu/~jhoe/doku/doku.php?id=18-447_introduction_to_computer_architecture
UNIT III
PROCESSOR AND CONTROL UNIT
Fundamental concepts
Execution of a complete instruction
Multiple bus organization
Hardwired control
Microprogrammed control
Pipelining: Basic concepts
Data hazards
Instruction hazards
Influence on Instruction sets
Data path and control consideration
Parallel Processors.
Overview
Central Processing Unit (CPU) : Fetching, Decoding, Ex-
ecuting instructions
A typical computing task consists of a series of steps
specified by a sequence of machine instructions that
constitute a program.
An instruction is executed by carrying out a sequence of
more operations.
Some Fundamental Con-
cepts
Fundamental Concepts
Processor fetches one instruction at a time and
perform the operation specified.
Instructions are fetched from successive memory
locations until a branch or a jump instruction is
encountered.
Processor keeps track of the address of the memory
location containing the next instruction to be fetched
using Program Counter (PC).
Instruction Register (IR)
Executing an Instruction
Fetch the contents of the memory location pointed
to by the PC. The contents of this location are
loaded into the IR (fetch phase).
IR ← [[PC]]
Assuming that the memory is byte addressable,
increment the contents of the PC by 1 (fetch phase).
PC ← [PC] + 1
Carry out the actions specified by the instruction in
the IR (execution phase).
Executing an Instruction
Transfer a word of data from one processor
register to another or to the ALU.
Perform an arithmetic or a logic operation
and store the result in a processor register.
Fetch the contents of a given memory
location and load them into a processor
register.
Store a word of data from a processor
register into a given memory location.
Register Transfers Riin
Internal processor
bus
Ri
Riout
Yin
Constant 4
Select MUX
A B
ALU
Zin
Z out
Figure 7.2. Input and output gating for the registers in Figure 7.1.
Bu s
Register Transfers
D Q
1
Q
Rio u t
Ri in
Clo c
k
All operations and data transfers are controlled by the processor clock.
Figure 7.3. Input and output gating for one register bit.
Performing an Arithmetic or
Logic Operation
The ALU is a combinational circuit that has no internal storage. It
performs arithmetic and logic operations on the two operands
applied to its A and B inputs
one of the operands is the output of the multiplexer MUX and the
other operand is obtained directly from the bus
The result produced by the ALU is stored temporarily in register
Z. Therefore, a sequence of operations to add the contents of reg-
ister RI to those of register R2 and store the result in register R3 is
Performing an Arithmetic or
Logic Operation
Performing an Arithmetic or
Logic Operation
Fetching a Word from Memory
To fetch a word of information from memory, the pro-
cessor has to specify the address of the memory loca-
tion where this information is stored and request a
Read operation.
The processor transfers the required address to the
MAR, whose output is connected to the address lines of
the memory bus.
At the same time, the processor uses the control lines
of the memory bus to indicate that a Read operation is
needed. When the requested data are received from
the memory they are stored in register MDR, from
where they can be transferred to other registers in the
processor.
Fetching a Word from Memory
The connections for register MDR has four control sig-
nals: MDRin, and MDout, control the connection to the
internal bus, and MDRinE and MDRoutE control the
connection to the external bus
MDR
Fi g u r e 7 . Co
4 .n n e c t i o n a n d c o n t r o l sgii gsnt aelr s M
f oDR.
r re
Add (R3),R1
1
2
3
4
5
6
Ac t i o n
PC o
Zo
M DRo
R3 o
R1 o
M DRo
u ,t
u ,t
u t,
u t,
M ARin , Re a d , Se l e c t Ad
PC in , Yin , W M F
u ,t I Rin
M ARin , Re a d
Yin , W M F
u ,t
C
C
Se l e c t Y,
Ad d , Zin
4 , d , Zin
struction
7 Zo u ,t R1 in , En d
Fi g u r 7
e . 6 . Co nt r o ls e q u e n cfeo re x e c u t i o
onf t h ei n s t r u c t iAd
ond ( R3 ) , R1 .
Add (R3), R1
Branch Instructions
A branch instruction replaces the contents of
PC with the branch target address, which is
usually obtained by adding an offset X given
in the branch instruction.
The offset X is usually the difference between
the branch target address and the address
immediately following the branch instruction.
Conditional branch
Branch Instructions
The gives a control sequence that
implements an unconditional branch
instruction. Processing starts, as usual, with
the fetch phase.
This phase ends when the instruction is
loaded into the IR in step 3. The offset value
is extracted from the IR by the instruction
decoding circuit, which will also perform sign
extension if required.
Execution of Branch Instruc-
tions
Since the value of the updated PC is already
available in register Y, the Offset X is gated
onto the bus in step 4, and an addition opera-
tion is performed. The result, which is tire
branch target address, is loaded into the PC
in step 5.
Execution of Branch Instruc-
tions
Step Action
.
In stru ctio n
deco der
IR
MDR
MAR
Memory b us Addres
datalin es lin es
Multiple-Bus Organization
Multiple-Bus Organization
Add R4, R5, R6
Step Action
External
inputs
Decoder/
IR
encoder
Condition
codes
Control signals
Control signal
Control variables
Control word
Control memory
Micro instructions
Micro programs
Microprogrammed Control
Laundry Example
Ann, Brian, Cathy, Dave
each have one load of clothes
to wash, dry, and fold A B C D
Washer takes 30 minutes
Time
30 40 20 30 40 20 30 40 20 30 40 20
Sequential laundry takes 6
A hours for 4 loads
If they learned pipelining,
how long would laundry
B take?
D
Traditional Pipeline Concept
6 PM 7 8 9 10 11 Midnight
Time
T
a 30 40 40 40 40 20
s
k A
Pipelined laundry takes
3.5 hours for 4 loads
O B
r
d C
e
r D
Traditional Pipeline Concept
Pipelining doesn’t help
6 PM 7 8 9
latency of single task, it
helps throughput of entire
Time
workload
T
30 40 40 40 40 20 Pipeline rate limited by
a
slowest pipeline stage
s
A Multiple tasks operating
k simultaneously using differ-
ent resources
O B Potential speedup = Number
r pipe stages
d C Unbalanced lengths of pipe
e stages reduces speedup
r Time to “fill” pipeline and
D
time to “drain” it reduces
speedup
Stall for Dependences
Use the Idea of Pipelining in a
Computer
Fetch + Execution
T ime
I1 I2 I3
Time
Clock cycle 1 2 3 4
F E F E F E
1 1 2 2 3 3 Instruction
I1 F1 E1
(a) Sequential execution
I2 F2 E2
Interstage buffer
B1
I3 F3 E3
Instruction Execution
fetch unit
unit (c) Pipelined execution
Computer
Fetch + Decode
+ Execution + Write
Role of Cache Memory
Each pipeline stage is expected to complete in one
clock cycle.
The clock period should be long enough to let the
slowest pipeline stage to complete.
Faster stages can only wait for the slowest one to
complete.
Since main memory is very slow compared to the
execution, if each instruction needs to be fetched
from main memory, pipeline is almost useless.
Fortunately, we have cache.
Pipeline Performance
The potential increase in performance
resulting from pipelining is proportional to the
number of pipeline stages.
However, this increase would be achieved
only if all pipeline stages require the same
time to complete, and there is no interruption
throughout program execution.
Unfortunately, this is not true.
Pipeline Performance
The previous pipeline is said to have been stalled for two clock
cycles.
Any condition that causes a pipeline to stall is called a hazard.
Data hazard – any condition in which either the source or the
destination operands of an instruction are not available at the
time expected in the pipeline. So some operation has to be de-
layed, and the pipeline stalls.
Instruction (control) hazard – a delay in the availability of an in-
struction causes the pipeline to stall.
Structural hazard – the situation when two instructions require
the use of a given hardware resource at the same time.
Pipeline Performance
Again, pipelining does not result in individual instruc-
tions being executed faster; rather, it is the through-
put that increases.
Throughput is measured by the rate at which in-
struction execution is completed.
Pipeline stall causes degradation in pipeline perfor-
mance.
We need to identify all hazards that may cause the
pipeline to stall and to find ways to minimize their
impact.
Data Hazards
Pipeline Performance
Again, pipelining does not result in individual instruc-
tions being executed faster; rather, it is the through-
put that increases.
Throughput is measured by the rate at which in-
struction execution is completed.
Pipeline stall causes degradation in pipeline perfor-
mance.
We need to identify all hazards that may cause the
pipeline to stall and to find ways to minimize their
impact.
Hazards
Hazards
hazards are problems with the instruction pipeline in
CPU micro architectures when the next instruction
cannot execute in the clock cycle, and can
potentially lead to incorrect computation results.
Three common types of hazards are data hazards,
structural hazards, and control hazards (branching
hazards).
Data Hazards
Data Hazards occur when an instruction depends on the result of
previous instruction and that result of instruction has not yet been
computed. whenever two different instructions use the same
storage. the location must appear as if it is executed in sequen-
tial order.
Ik Fk Ek
Fi gu r e 8 . An
8 . i d l yec lce c a u s e d b y a b r a n c h i n s t r u c t i o n .
Unconditional Branches
Unconditional branch instructions such as
GOTO are used to unconditionally
"jump" to (begin execution of) a different
instruction sequence.
In programming, a GOTO, BRANCH or
JUMP instruction that passes control to a
different part of the program.
I3 F3 X
Ik Fk Ek
Fi gu r e 8 . An
8 . i d l yec lce c a u s e d b y a b r a n c h i n s t r u c t i o n .
Unconditional Branches
.
I3 F3 X
Ik Fk Ek
Fi gu r e 8 . An
8 . i d l yec lce c a u s e d b y a b r a n c h i n s t r u c t i o n .
Unconditional Branches
.
Instruction Queue and
Prefetching
Instruction fetch unit
Instruction queue
F : Fetch
instruction
D : Dispatch/
Decode E : Ex ecute W : Write
instruction results
unit
Figure 8.10. Use of an instruction queue in the hardware organization of Figure 8.2b.
Conditional Braches
A conditional branch instruction introduces
the added hazard caused by the dependency
of the branch condition on the result of a
preceding instruction.
The decision to branch cannot be made until
the execution of that instruction has been
completed.
Branch instructions represent about 20% of
the dynamic instruction count of most
programs.
Conditional Braches
A programming instruction that directs the
computer to another part of the program
based on the results of a compare
Conditional branch is happened based on
some condition like if condition in C. Transfer
of control of the program will depend on the
outcome of this condition.
Delayed Branch
The instructions in the delay slots are always
fetched. Therefore, we would like to arrange
for them to be fully executed whether or not
the branch is taken.
The objective is to place useful instructions in
these slots.
The effectiveness of the delayed branch ap-
proach depends on how often it is possible to
reorder instructions.
Delayed Branch
LOOP Shift_left R1
Decrement R2
Branch=0 LOOP
NEXT Add R1,R3
LOOP Decrement R2
Branch=0 LOOP
Shift_left R1
NEXT Add R1,R3
Compare R3,R4
Add R1,R2
Branch=0 ...
Superscalar
The maximum throughput of a pipelined processor is
one instruction per clock cycle.
If we equip the processor with multiple processing
units to handle several instructions in parallel in each
processing stage, several instructions start
execution in the same clock cycle – multiple-issue.
Processors are capable of achieving an instruction
execution throughput of more than one instruction per
cycle – superscalar processors.
Multiple-issue requires a wider path to the cache and
multiple execution units.
Fi g u r e 8 . 1A9 .p r o c e s s o r wi t h t wo e x e c u t i o n u n i t s .
Superscalar
Advantages of Superscalar Architecture :
The compiler can avoid many hazards through
judicious selection and ordering of instructions.
The compiler should strive to interleave floating
point and integer instructions. This would enable
the dispatch unit to keep both the integer and
floating point units busy most of the time.
In general, high performance is achieved if the
compiler is able to arrange program instructions
to take maximum advantage of the available
hardware units.
Fi g u r e 8 . 1A9 .p r o c e s s o r wi t h t wo e x e c u t i o n u n i t s .
Superscalar
Disadvantages of Superscalar Architecture :
In a Superscalar Processor, the detrimental
effect on performance of various hazards
becomes even more pronounced.
Due to this type of architecture, problem in
scheduling can occur.
Fi g u r e 8 . 1A9 .p r o c e s s o r wi t h t wo e x e c u t i o n u n i t s .
Parallel processing
Parallel processing is a method in computing
of running two or more processors (CPUs) to
handle separate parts of an overall task. ...
Any system that has more than one CPU can
perform parallel processing, as well as
multi-core processors which are commonly
found on computers today.
Fi g u r e 8 . 1A9 .p r o c e s s o r wi t h t wo e x e c u t i o n u n i t s .
Parallel processing
Fi g u r e 8 . 1A9 .p r o c e s s o r wi t h t wo e x e c u t i o n u n i t s .
vector processing
vector processing Processing of sequences of
data in a uniform manner, a common occur-
rence in manipulation of matrices (whose ele-
ments are vectors) or other arrays of data. A
vector processor will process sequences of input
data as a result of obeying a single vector in-
struction and generate a result data sequence.
Fi g u r e 8 . 1A9 .p r o c e s s o r wi t h t wo e x e c u t i o n u n i t s .
vector processing