0% found this document useful (0 votes)
54 views

Cs/Coe 1541: Single and Multi-Cycle Implementations

The document describes the progression from a single-cycle to a multi-cycle to a pipelined computer architecture. A single-cycle design processes all instructions in one clock cycle but is limited by the slowest operation. A multi-cycle design breaks instructions into multiple clock cycles, allowing different operations to occur simultaneously but still completes one instruction before starting the next. A pipelined design overlaps multiple instructions by partitioning tasks into stages so that multiple instructions can be in different stages at once, improving throughput.

Uploaded by

Bobo Joo
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
54 views

Cs/Coe 1541: Single and Multi-Cycle Implementations

The document describes the progression from a single-cycle to a multi-cycle to a pipelined computer architecture. A single-cycle design processes all instructions in one clock cycle but is limited by the slowest operation. A multi-cycle design breaks instructions into multiple clock cycles, allowing different operations to occur simultaneously but still completes one instruction before starting the next. A pipelined design overlaps multiple instructions by partitioning tasks into stages so that multiple instructions can be in different stages at once, improving throughput.

Uploaded by

Bobo Joo
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 93

CS/CoE 1541

Single and Multi-cycle


Implementations

Introduction to Computer Architecture

Typical Instruction Execution


Single Cycle Recap
1) Fetch instruction from memory

2) Decode instruction
3) If necessary, perform an ALU operation
4) If memory access, perform load/store
5) Write results back to register file and increment the PC

Introduction to Computer Architecture

Fetching Instruction

Memory to hold instructions


Program Counter (PC) to
generate address of instruction
Adder to increment the PC for
the next instructions address

Program Counter
(Register)

Next

Instruction
Memory (RAM)
Instruction
address

Instruction

Adder
Current PC

Next PC

Current PC

Introduction to Computer Architecture

Instruction Fetch Unit


Current PC

Adder

4
Program Counter
(Register)

Next

Current PC

Instruction
instruction
address
Write
Introduction to Computer Architecture

Instruction
Memory (RAM)
4

ALU Operation
Consider a basic ALU operation
add
R1,R2,R3
Requires a Register File and an ALU

Register
Numbers

Read Reg 1

Read
Read Reg 2 Data 1
Write Reg

Data

ALU Operation

Read
Data 2

Write Data

Data
ALU

Register File

Introduction to Computer Architecture

ADD R2, R3, R4


000000 00011
op
rs

00100
rt

00010
rd

00000
shamt

100000
funct

Register File
Instruction
Register
Fields

ALU Operation

Read Reg 1
Read
Read Reg 2 Data 1
Write Reg

Data

Introduction to Computer Architecture

Read
Data 2

Write Data

Write Enable

ALU

Accessing Memory (loads and stores)


LW R4,0x10(R2)
Instruction

Register File
Read Reg 1
Read
Read Reg 2 Data 1
Write Reg
Write Data

Introduction to Computer Architecture

Read
Data 2

ALU

Data Memory
(RAM)

Load: From Memory to Register File


LW R4,0x10(R2)
Instruction

Register File

Data Memory
(RAM)

Read Reg 1

Read
Read Reg 2 Data 1
Write Reg

Read
Data 2

Write Data

ALU

Read
data

Write Enable
Introduction to Computer Architecture

Computing the Address (part 1)


LW R4,0x10(R2)
Instruction

Data Memory
(RAM)

Register File
Read Reg 1
Read
Read Reg 2 Data 1
Write Reg

Read
Data 2

Write Data

Read/Write
address
ALU

Read
data

Write Enable
Introduction to Computer Architecture

Computing the Address (part 2)


LW R4,0x10(R2)
Instruction

ALU Add
Data Memory
(RAM)

Register File
Read Reg 1
Read
Read Reg 2 Data 1
Write Reg

Data

Read/Write
address

Read
Data 2

Write Data

ALU

Read
data

Write Enable

16
Introduction to Computer Architecture

Sign
extend

32
10

Sign Extender
000000 00011
op
rs

00100
rt

00010000 00100000
immediate

Register File
Read Reg 1
Read
Read Reg 2 Data 1
Write Reg

Read
Data 2

Write Data

16 bits

Write Enable
Introduction to Computer Architecture

16 bits

15

...
ALU
16 bits
Sign extend
0
11

Store: From Register File to Memory


SW R4,0x10(R2)

ALU Add

Instruction

Data Memory
(RAM)

Register File
Read Reg 1
Read
Read Reg 2 Data 1
Write Reg

Data

Read/Write
address

Read
Data 2

Write Data

ALU
Write Data

Read
data

Write Enable

16
Introduction to Computer Architecture

Sign
extend

32
12

Branch
BEQ

R1, R2, label

Instruction
Register File
Read Reg 1

Data

Read
Read Reg 2 Data 1
Write Reg
Write Data

Introduction to Computer Architecture

Zero

Read
Data 2

To Branch
Control
Logic

ALU

13

Need Adder to Compute Branch Target


BEQ

R1, R2, label

ADDER

PC + 4 from
instruction datapath
Instruction

Branch
Target

<< 2

Register File
Read Reg 1

Data

Read
Read Reg 2 Data 1
Write Reg

Zero

Read
Data 2

Write Data

16
Introduction to Computer Architecture

To Branch
Control
Logic

ALU
Sign
extend

32
14

Data Path for Memory and R-type Instructions

Register File
Read Reg 1
Read
Read Reg 2Data 1

Instruction

Write Reg Read


Write Data Data 2

Zero

Data Memory
(RAM)
M
U
X

M
U
X

ALU
Write
Data

Sign
extend
16

Introduction to Computer Architecture

Read
data

32

15

Complete Single-cycle Datapath


Current PC

4
PC

M
U
X

Adder
Instruction
Memory (RAM)

<< 2
ADDER

Register File
Instruction

Read Reg 1
Read
Read Reg 2Data 1
Write Reg Read
Write Data Data 2

Zero

Data Memory
(RAM)
M
U
X

M
U
X

ALU

16
Introduction to Computer Architecture

Sign
exten
d

Write
Data

Read
data

32
16

Whats Wrong with a Single-Cycle


Implementation
That was a single cycle machine?
Yep! It was assumed that data flows through all parts of the datapath
in ONE clock cycle

How long is a cycle

ALU
10 ns
Register File
5 ns
Memory
10 ns
Assume everything else takes zero time

Introduction to Computer Architecture

17

Instruction Timings
Instr Type
R-format
Load
Store
Branch
Jump

InstrMem
10
10
10
10
10

Reg Read
5
5
5
5
-

ALU
10
10
10
10
-

DataMem
10
10
-

Register File

PC

Read Reg 1
Read
Read Reg 2Data 1
Write Reg Read
Write Data Data 2

Zero

Reg Write
5
5
-

Data Memory
(RAM)
M
U
X

M
U
X

ALU

16

Introduction to Computer Architecture

Sign
exten
d

Total
30 ns
40 ns
35 ns
25 ns
10 ns

Write
Data

Read
data

32

18

Whats Wrong with a Single-Cycle


Implementation
Difficult to implement variable cycle clock
Usually run the clock at the SLOWEST speed
This is called the critical path
The critical path is the path through the system which limits
performance

What if we add a floating point unit?


FP (floating point) math can take a very long time
100s of ns for multiply and divide
Lots of techniques to reduce time - will cover later on

How about breaking the machine into parts

Introduction to Computer Architecture

19

Multiple Cycle Implementation Datapath


Targ
et
30
PC[31:28]

Shift
left 2

Jump
Address

M
U
X

32

Memory

Read Reg1
Read
Read Reg2Data 1

PC
M
U
X

M
U
X

Instruction
Register

Write Reg Read


Data 2
Write Data

A
B
4

M
U
X

Zero

ALUOut
M
U
X

ALU

Write Data

MDR

M
U
X

16

Introduction to Computer Architecture

Sign
Exten
d

Shift
left 2
32

20

Full Diagram of Multi-cycle Machine


Figure 5.33

Introduction to Computer Architecture

21

Execution Steps (1)


Instruction Fetch

IR = Memory[PC];
PC = PC + 4;

Introduction to Computer Architecture

22

Execution Steps (2)


Instruction Decode and Register Fetch

A = Reg[IR[25..21]];
B = Reg[IR[20..16]];
ALUOut = PC + (signExtend(IR[15..0]) << 2);

Introduction to Computer Architecture

23

Execution Step (3)


Execution, memory address computation or branch completion
Memory Reference
ALUOut = A + signExtend(IR[15..0]);
Arithmetic/Logical Operation

ALUOut = A + B
Branch
If (A == B) PC = ALUOut;
Jump
PC = PC[31 ..28] || (IR[25..0) << 2);

Introduction to Computer Architecture

24

Execution Step (4)


Memory access or R-type instruction completion
Memory Reference
MDR = Memory[ALUOut];
or
Memory[ALUOut] = B;
Arithmetic/Logical Instructions (R-type)
Reg[IR[15..11]] = ALUOut;

Introduction to Computer Architecture

25

Execution Step (5)


Memory Read completion

Reg[IR[20..16]] = MDR;

Introduction to Computer Architecture

26

Multicycle Control
MemReadMemWrite

RegWrite

IRWrite

IorD

RegDest

PC
M
U
X

ALU SelA

M
U
X

Instruction
Register

Read Reg1
Read
Read Reg2Data 1
Write Reg Read
Data 2
Write Data

A
B
4

M
U
X

Zero

ALUOut
ALU

M
U
X

Write Data

MDR

M
U
X

16
MemToReg

ALU SelB
Shift
Sign
left 2
Exten
32
d
Instruction [5:0]

ALU
Contr
ol
ALU Op

Introduction to Computer Architecture

27

Performance of Multicycle Implementation


Each type of instruction can take a variable # of cycles
Example
Assume the following instruction distributions:

loads
stores
R-type
branches
jump

5 cycles
4 cycles
4 cycles
3 cycles
3 cycles

22%
11%
49%
16%
2%

Whats the average Cycles Per Instruction (CPI)


CPI = (CPU clock cycles/Instruction Count)
CPI = (5 cycles * 0.22) + (4 cycles * 0.11) + (4 cycles * 0.49)
+ (3 cycles * 0.16) + (3 cycles * 0.02)
CPI = 4.04 cycles per instruction

What was the CPI for the single-cycle machine?


Single cycle implies 1 clock cycle per instruction --> CPI = 1.0
So isnt the single-cycle machine faster?

Introduction to Computer Architecture

28

CS/CoE 1541
Pipelining

Introduction to Computer Architecture

29

Looks a Lot Like a Multicycle Processor


What are the steps

Fetch an instruction
Decode the instruction
ALU OP
Memory Access
Write-back
Memory

M
U
X

M
U
X

Instruction
Register

Read Reg1
Read
Read Reg2Data 1

M
U
X

Write Reg Read


Data 2
Write Data

M
U
X

Zero

ALU

Write Data
M
U
X

Introduction to Computer Architecture

16

Sign
Exten
d

Shift
left 2
32

30

Performance of Pipelined Systems


time

Unpipelined

instructions

Pipelined

time

latency

instructions

Ideally, Speeduppipeline =
Introduction to Computer Architecture

Timesequential
Pipeline Depth
31

MIPS Pipeline Stages

Stage 1: Instruction Fetch


Stage 2: Instruction Decode
Stage 3: Execute
Stage 4: Memory Access
Stage 5: Write Back (to register file)

Introduction to Computer Architecture

32

How Do We Partition the Datapath into Stages


STAGE 1
Instruction Fetch

STAGE 3
ALU

STAGE2
Decode

STAGE 4
MemAcc

STAGE 5
Writeback

Current PC
4

Adder

PC

Register File
Read Reg 1
Read
Data
Read Reg
2 1

Instruction

Write Reg
Read
Data 2
Write Data

Instruction
Memory (RAM)

16

Introduction to Computer Architecture

Sign
exte
nd

ALU

M
U
X

Data Memory
(RAM)
Zero
M
U
X

Write
Data

Read
data

32

33

But How to We Separate the Different Stages


STAGE 1 -Instruction Fetch

STAGE2
Decode

STAGE 3
ALU

STAGE 4 STAGE 5
MemAcc Writeback

Current PC
4
PC

Adder

R
E
G
I
S
Instruction
Memory (RAM)T
E
R
S

Introduction to Computer Architecture

R
E
Register File
Read Reg 1 G
Read
Data
Read Reg
2 1I
S
Write Reg
Read
Data 2T
Write Data
E
Sign
R
exte
nd
32 S
16

M
U
X

R
E
ALU
G
I
S
T
Write E
Data R
S

R
E
Data Memory
(RAM) G
I
S
T
E
R
S

M
U
X

Read
data

34

Complete 5 Stage Pipeline


M
U
X

Current PC
4

IF/ID

ID/EX

EX/MEM

MEM/WB

Adder
<< 2

ADDER

PC

Data Memory
(RAM)

Register File
Zero

Read Reg 1
Read
Read RegData
2 1
Write RegRead

Instruction
Memory (RAM)

Write DataData 2

16

Introduction to Computer Architecture

Sign
exten
d

M
U
X

M
U
X

ALU

Read
data

32

35

Flow of Instructions Through Pipeline


Program
Execution

Time
Clock
Clock Clock Clock
Clock
Clock Clock
Cycle 1 Cycle 2 Cycle 3Cycle 4 Cycle 5 Cycle 6Cycle 7

LW R1, 100(R0)

IM

LW R2,200(R0)

LW R3, 300(R0)

Introduction to Computer Architecture

REG

IM

ALU

REG

IM

DM

ALU

REG

Reg

DM

ALU

Reg

DM

Reg

36

Stage 1 - IF (Instruction Fetch)


Instruction Fetch
LW
M
U
X

Current PC
4

IF/ID

ID/EX

EX/MEM

MEM/WB

Adder
<< 2

ADDER

PC

Data Memory
(RAM)

Register File
Zero

Read Reg 1
Read
Read RegData
2 1
Write RegRead

Instruction
Memory (RAM)

Write DataData 2

16
Introduction to Computer Architecture

Sign
exte
nd

M
U
X

ALU

M
U
X
Read
data

32
37

Stage 2 - ID (Instruction Decode)


Instruction Decode
LW
M
U
X

Current PC
4

IF/ID

ID/EX

EX/MEM

MEM/WB

Adder
<< 2

ADDER

PC

Data Memory
(RAM)

Register File
Zero

Read Reg 1
Read
Read RegData
2 1
Write RegRead

Instruction
Memory (RAM)

Write DataData 2

16
Introduction to Computer Architecture

Sign
exte
nd

M
U
X

ALU

M
U
X
Read
data

32
38

Stage 3 - EX (Execution)
Execution
LW
M
U
X

Current PC
4

IF/ID

ID/EX

EX/MEM

MEM/WB

Adder
<< 2

ADDER

PC

Data Memory
(RAM)

Register File
Zero

Read Reg 1
Read
Read RegData
2 1
Write RegRead

Instruction
Memory (RAM)

Write DataData 2

16
Introduction to Computer Architecture

Sign
exte
nd

M
U
X

ALU

M
U
X
Read
data

32
39

Stage 4 - MEM (Memory)


Memory
LW
M
U
X

Current PC
4

IF/ID

ID/EX

EX/MEM

MEM/WB

Adder
<< 2

ADDER

PC

Data Memory
(RAM)

Register File
Zero

Read Reg 1
Read
Read RegData
2 1
Write RegRead

Instruction
Memory (RAM)

Write DataData 2

16
Introduction to Computer Architecture

Sign
exte
nd

M
U
X

ALU

M
U
X
Read
data

32
40

Stage 5 - WB (Write Back)


WriteBack
LW
M
U
X

Current PC
4

IF/ID

ID/EX

EX/MEM

MEM/WB

Adder
<< 2

ADDER

PC

Data Memory
(RAM)

Register File
Zero

Read Reg 1
Read
Read RegData
2 1
Write RegRead

Instruction
Memory (RAM)

Write DataData 2

M
U
X

ALU

M
U
X
Read
data

Sign
extend

16
Introduction to Computer Architecture

32
41

Clock Speed
If a single-cycle machine is broken into 2 pipeline stages,
how much faster can the clock run?
Latency is time from start to completion of instruction
100 nsecs

Instructions

Result

Instructions

Result

Introduction to Computer Architecture

42

How Far Can We Go?


Latency is time from start to completion of instruction
100 nsecs

Instructions

Result

Instructions

Result

Instructions

Introduction to Computer Architecture

Result

43

5 Stage Pipeline
M
U
X

Current PC
4

IF/ID

ID/EX

EX/MEM

MEM/WB

Adder
<< 2

ADDER

PC

Data Memory
(RAM)

Register File
Zero

Read Reg 1
Read
Read RegData
2 1
Write RegRead

Instruction
Memory (RAM)

Write DataData 2

M
U
X

ALU

M
U
X
Read
data

Sign
extend

16

Introduction to Computer Architecture

32

44

Pipeline Control
M
U
X

Current PC
4

W
B
M

C
O
N
T
R
O
L

IF/ID

W
B
M

EX

EX/MEM

ID/EX

Adder

<< 2

ADDER

PC

ALU Control
Register File

Write RegRead
Write DataData 2

16

Introduction to Computer Architecture

Sign
exte
nd

Data Memory
(RAM)

Zero

Read Reg 1
Read
Read RegData
2 1
Instruction
Memory (RAM)

W
B
MEM/WB

M
U
X

M
U
X

ALU

Read
data

32

45

Flow of Instructions Through Pipeline


Program
Execution

Time
Clock
Clock Clock Clock
Clock
Clock Clock
Clock
Cycle 1 Cycle 2 Cycle 3Cycle 4 Cycle 5 Cycle 6Cycle 7 Cycle 8

ADD R2,R3,R1

IM

SUB R5,R6,R7

ADD R10,R11,R12

Introduction to Computer Architecture

REG

IM

ALU

REG

IM

DM

ALU

REG

Reg

DM

ALU

Reg

DM

Reg

46

Contention at the Register File


Program
Execution

Time
Clock
Clock Clock Clock
Clock
Clock Clock
Clock
Cycle 1 Cycle 2 Cycle 3Cycle 4 Cycle 5 Cycle 6Cycle 7 Cycle 8

ADD R10, R11, R12


IM

ADD R17, R0, R0

ADD R16, R0, R0

SUB R20, R21, R22

ADD R30, R17, R18


Introduction to Computer Architecture

REG

IM

ALU

REG

IM

DM

REG

ALU

DM

REG

IM

ALU

REG
IM

REG

DM

ALU

REG

REG

DM

REG

ALU

DM
47

Oops - Sometimes Results Are Not Ready


Program
Execution

Time
Clock
Clock Clock Clock
Clock
Clock Clock
Clock
Cycle 1 Cycle 2 Cycle 3Cycle 4 Cycle 5 Cycle 6Cycle 7 Cycle 8

ADD R2,R3,R1

IM

SUB R5,R6,R7

REG

IM

ALU

REG

Reg

DM

ALU

DM

Reg
Writeback
Result into R10

ADD R10,R11,R12

ADD R12,R10,R11

IM

REG

IM

ALU

REG

DM

ALU

Reg

DM

Reg

Read value out of R10


Introduction to Computer Architecture

48

Data Hazards
Programs assume instructions are executed sequentially with one
instruction completing before the next one begins
Usually the compiler assumes the single machine model

Pipelining violates this assumption


Dependencies can occur between instructions executing concurrently
within the pipeline - if the dependencies are based on data
requirements, we call them Data Hazards
Types of data hazards
Read-after-write (RAW)
A true dependency

Write-after-read (WAR)
Artificial dependency due to register assignment

Write-after-write (WAW)
Artificial dependency due to register assignment

Introduction to Computer Architecture

49

Coping with Data Hazards


Program
Execution

Time
Clock
Clock Clock Clock
Clock
Clock Clock
Clock
Cycle 1 Cycle 2 Cycle 3Cycle 4 Cycle 5 Cycle 6Cycle 7 Cycle 8

ADD R10, R11, R12


IM

REG

ADD R12, R10, R11

IM

ADD R11, R10, R12

Introduction to Computer Architecture

ALU

REG

IM

DM

ALU

REG

Reg

DM

Reg

ALU

DM

Reg

50

Solution 1 : Stall
Program
Execution

Time
Clock
Clock Clock Clock
Clock
Clock Clock
Clock
Cycle 1 Cycle 2 Cycle 3Cycle 4 Cycle 5 Cycle 6Cycle 7 Cycle 8

ADD R10, R11, R12


IM

REG

ADD R12, R10, R11

IM

ADD R11, R10, R12

Introduction to Computer Architecture

ALU

DM

bubble bubble

Reg

REG

IM

ALU

REG

DM

ALU

51

Recall the Registers Between Pipeline Stages


M
U
X

Current PC
4

IF/ID

ID/EX

EX/MEM

MEM/WB

Adder
<< 2

ADDER

PC

Data Memory
(RAM)

Register File
Zero

Read Reg 1
Read
Read RegData
2 1
Write RegRead

Instruction
Memory (RAM)

Write DataData 2

M
U
X

ALU

M
U
X
Read
data

Sign
extend

16

Introduction to Computer Architecture

32

52

Stall Conditions
Need to detect data hazard
Occurs when one instruction tries to read result from previous
instruction that hasnt completed yet.
Specifically,
When Instruction in Execute stage tries to read a register that an
instruction in the MemAcc or WB stages will write back to the
Register File

H&P Notation
ID/EX.RegisterRs refers to the number of the first source register
found in the pipeline register ID/EX.
ID/EX. RegisterRt refers to the number of the second source register
found in the pipeline register ID/EX.

Introduction to Computer Architecture

53

Recall What an Instruction Looks Like


add R8, R17, R18
is stored in binary format as
00000010

00110010

01000000 00100000

MIPS lays out instructions into fields

31 26 25
21 20 16 15 11 10 6
000000 10001 10010
01000 00000
op
rs
rt
rd
shamt

5
0
100000
funct

op
operation of the instruction
rs
first register source operand
rt second register source operand
rd
register destination operand
shamt
shift amount
funct
function (select type of operation)

Introduction to Computer Architecture

54

Remember the Registers In Between Each


Stage
M
U
X

Current PC
4

IF/ID

ID/EX

EX/MEM

MEM/WB

Adder
<< 2

ADDER

PC

Data Memory
(RAM)

Register File
Zero

Read Reg 1
Read
Read RegData
2 1
Write RegRead

Instruction
Memory (RAM)

Write DataData 2

M
U
X

ALU

M
U
X
Read
data

Sign
extend

16

Introduction to Computer Architecture

32

55

Data Hazard Stall Conditions


1a EX/MEM. RegisterRd
1b EX/MEM.RegisterRd

IF/ID

==
==

ID/EX
Rs
Rt
Rd

Current PC
4

ID/EX.RegisterRs
ID/EX. RegisterRt

EX/MEM

Rs =? Rd
Rt =? Rd Rd

MEM/WB

Adder
<< 2

ADDER

PC

Data Memory
(RAM)

Register File
Zero

Read Reg 1
Read
Data 1
Read Reg 2
Write Reg Read
Write Data Data 2

Instruction
Memory (RAM)

M
U
X

ALU

M
U
X
Read
data

Sign
extend

Introduction to Computer Architecture

16

32
56

Data Hazard Stall Conditions (cont)


1a
1b
2a
2b

EX/MEM. RegisterRd
EX/MEM.RegisterRd
MEM/WB. RegisterRd
MEM/WB. RegisterRd
IF/ID

==
==
==
==
ID/EX

EX/MEM

Rs =? Rd
Rt =? RdRd

Rs
Rt
Rd

Current PC
4

ID/EX.RegisterRs
ID/EX. RegisterRt
ID/EX. RegisterRs
ID/EX. RegisterRt
MEM/WB
Rd

Adder
<< 2

ADDER

PC

Data Memory
(RAM)

Register File
Zero

Read Reg 1
Read
Data 1
Read Reg 2
Write Reg Read
Write Data Data 2

Instruction
Memory (RAM)

M
U
X

ALU

M
U
X
Read
data

Sign
extend

Introduction to Computer Architecture

16

32
57

Data Hazard Logic


Data Hazard Logic
Rs =? Rd
Rt =? Rd
between ID/EX, EX/MEM, and MEM/WB Stages
IF/ID

ID/EX
Rs
Rt
Rd

Current PC
4

EX/MEM

MEM/WB

Rd

Rd

Adder
<< 2

ADDER

PC

Data Memory
(RAM)

Register File
Zero

Read Reg 1
Read
Data 1
Read Reg 2
Write Reg Read
Write Data Data 2

Instruction
Memory (RAM)

M
U
X

ALU

M
U
X
Read
data

Sign
extend

Introduction to Computer Architecture

16

32
58

Example
sub
and
or
add
sw

R2, R1, R3
R12, R2, R5
R13, R6, R2
R14, R2, R2
R15, 100(R2)

Rd = R2
Rd = R12
Rd = R13
Rd = R14
Rd = R15

Rs = R1
Rs = R2
Rs = R6
Rs = R2
Rs = R2

Rt = R3
Rt = R5
Rt = R2
Rt = R2
Rt = XX

SUB-AND Hazard
EX/MEM.RegisterRd

== ID/EX. RegisterRs

== R2

== ID/EX. RegisterRt

== R2

SUB-OR Hazard
MEM/WB.RegisterRd

Do we care about the interaction between sub (instruction 1) and add


(instruction 4)?
Introduction to Computer Architecture

59

Example (cont)
Data Hazard Logic

Current PC
4

Adder

ID/EX

EX/MEM

MEM/WB

Rs =
Rt =
Rd =

Rd =

Rd =

SUB R2, R1, R3

IF/ID

<< 2

ADDER

PC

Data Memory
(RAM)

Register File
Zero

Read Reg 1
Read
Data 1
Read Reg 2
Write Reg Read
Write Data Data 2

Instruction
Memory (RAM)

M
U
X

ALU

M
U
X
Read
data

Sign
extend

Introduction to Computer Architecture

16

32
60

Example (cont)
Data Hazard Logic

PC

Adder

SUB R2, R1, R3

Current PC

ID/EX

AND R12, R2, R5

IF/ID

Rs = R3
Rt = R1
Rd = R2

<< 2

Rd =

Rd =

Data Memory
(RAM)
Zero

Read Reg 1
Read
Data 1
Read Reg 2

Instruction
Memory (RAM)

MEM/WB

ADDER

Register File

Write Reg Read


Write Data Data 2

EX/MEM

M
U
X

ALU

M
U
X
Read
data

Sign
extend

Introduction to Computer Architecture

16

32
61

Example (cont)
Data Hazard Logic
EX/MEM.RegisterRD = R2 != ID/EX.RegisterRs = R5
EX/MEM.RegisterRD = R2 == ID/EX.RegisterRt = R2

Adder

PC

EX/MEM
SUB R2, R1, R3

OR R13, R6, R2

Current PC

ID/EX
AND R12, R2, R5

IF/ID

Rs = R5
Rt = R2
Rd = R12

<< 2

ADDER

Rd =

Zero

Read Reg 1
Read
Data 1
Read Reg 2

Instruction
Memory (RAM)

Rd = R2

Data Memory
(RAM)

Register File

Write Reg Read


Write Data Data 2

MEM/WB

M
U
X

ALU

M
U
X
Read
data

Sign
extend

Introduction to Computer Architecture

16

32
62

Example (cont)
Data Hazard Logic

PC

EX/MEM
AND R12, R2, R5

Adder

ADD

OR R13, R6, R2

Current PC

ID/EX

R14, R2, R2

IF/ID

ID/EX.RegisterRs = R6
ID/EX.RegisterRt = R2
ID/EX.RegisterRs = R6
ID/EX.RegisterRt = R2

Rs = R6
Rt = R2
Rd = R13

<< 2

ADDER

Zero

Read Reg 1
Read
Data 1
Read Reg 2

Instruction
Memory (RAM)

Rd = R2

Data Memory
(RAM)

Register File

Write Reg Read


Write Data Data 2

Rd = R12

MEM/WB

SUB R2, R1, R3

EX/MEM.RegisterRD = R12 !=
EX/MEM.RegisterRD = R12 !=
MEM/WB.RegisterRD = R2 !=
MEM/WB.RegisterRD = R2 ==

M
U
X

ALU

M
U
X
Read
data

Sign
extend

Introduction to Computer Architecture

16

32
63

No Dependence Between Instruction 1 and 4


Program
Execution

Clock
Clock Clock Clock
Cycle 1 Cycle 2 Cycle 3Cycle 4

SUB R2, R1, R3 IM


AND R12, R2, R5

OR R13, R6, R2

ADD R14, R2, R2

Introduction to Computer Architecture

REG

IM

ALU

REG
IM

Clock
Clock Clock
Clock
Cycle 5 Cycle 6Cycle 7 Cycle 8

DM

REG

ALU

DM

REG

ALU

DM

REG

IM

REG

ALU

REG
DM

REG

64

How Do We Stall the Pipeline?


Compiler can insert nops
Program
Execution

Time
Clock
Clock Clock Clock
Cycle 1 Cycle 2 Cycle 3Cycle 4

ADD R10, R11, R12


IM
nop

nop

ADD R12, R10, R11

Introduction to Computer Architecture

REG

IM

ALU

REG
IM

DM

ALU

REG

IM

Clock
Clock Clock
Clock
Cycle 5 Cycle 6Cycle 7 Cycle 8
Reg

DM

Reg

ALU

DM

REG

ALU

Reg
DM

Reg

65

Hardware Can Simulate NOPS


Program
Execution

Time
Clock
Clock Clock Clock
Cycle 1 Cycle 2 Cycle 3Cycle 4

ADD R10, R11, R12


IM
stall

stall

ADD R12, R10, R11

Introduction to Computer Architecture

REG

IM

ALU

DM

Clock
Clock Clock
Clock
Cycle 5 Cycle 6Cycle 7 Cycle 8
Reg

bubble bubble bubble


IM

bubble bubble

IM

REG

bubble
bubble bubble
ALU

DM

Reg

66

Reducing Data Hazards: Forwarding


Data may be already computed - just not in the Register File
Program
Execution

Time
Clock
Clock Clock Clock
Clock
Clock Clock
Clock
Cycle 1 Cycle 2 Cycle 3Cycle 4 Cycle 5 Cycle 6Cycle 7 Cycle 8

ADD R10, R11, R12


IM

REG

ADD R12, R10, R11

IM

Introduction to Computer Architecture

ALU

REG

DM

Reg

ALU

DM

Reg

67

Additions to the Datapath for Forwarding


M
U
X

Current PC
4

ADD R12, R11, R10 ADD R10,R11, R12

IF/ID

ID/EX

EX/MEM

MEM/WB

Adder
<< 2

ADDER

PC

Data Memory
(RAM)

Register File
Zero

Read Reg 1
Read
Read RegData
2 1
Write RegRead

Instruction
Memory (RAM)

Write DataData 2

M
U
X

M
U
X

ALU

Read
data

Sign
extend

16

Introduction to Computer Architecture

32

68

Forwarding Continued
Program
Execution

Time
Clock
Clock Clock Clock
Clock
Clock Clock
Clock
Cycle 1 Cycle 2 Cycle 3Cycle 4 Cycle 5 Cycle 6Cycle 7 Cycle

ADD R10, R11, R12


IM

REG

ADD R12, R10, R11

IM

ADD R4, R5, R10

Introduction to Computer Architecture

ALU

REG

IM

DM

Reg

ALU

DM

REG

ALU

Reg

DM

Reg

69

More Additions to the Datapath


M
U
X

Current PC
4

ADD R4, R5, R10

IF/ID

ID/EX

ADD R12, R11, R10 ADD R10,R11, R12

EX/MEM

MEM/WB

Adder
<< 2

ADDER

PC

Data Memory
(RAM)

Register File
Zero

Read Reg 1
Read
Read RegData
2 1
Write RegRead

Instruction
Memory (RAM)

Write DataData 2

M
U
X

M
U
X

ALU

Read
data

Sign
extend

16

Introduction to Computer Architecture

32

70

Forwarding Doesnt Always Work


Program
Execution

Time
Clock
Clock Clock Clock
Clock
Clock Clock
Clock
Cycle 1 Cycle 2 Cycle 3Cycle 4 Cycle 5 Cycle 6Cycle 7 Cycle 8

LW R10, 0x00(R4)IM

REG

ADD R12, R10, R11

IM

Introduction to Computer Architecture

ALU

REG

DM

Reg

ALU

DM

Reg

71

Loads and Stores Require a Load Delay Slot


Program
Execution

Time
Clock
Clock Clock Clock
Clock
Clock Clock
Clock
Cycle 1 Cycle 2 Cycle 3Cycle 4 Cycle 5 Cycle 6Cycle 7 Cycle 8

LW R10, 0x00(R4)IM

nop

ADD R12, R10, R11

Introduction to Computer Architecture

REG

IM

ALU

REG

IM

DM

Reg

ALU

DM

REG

ALU

Reg

DM

Reg

72

MIPS Load-Delay Slot


MIPS exposed the load-delay slot to the compiler
This makes it part of the architecture, not just an implementation
detail
Therefore, its up to the compiler (or assembly code writer) to
make sure that the instruction after a load does not depend on
the result of the load

An alternative would have been to force the hardware to


detect the data hazard and stall the pipeline
Most of todays architectures detect the hazard and stall

Introduction to Computer Architecture

73

3 Types of Data Hazards


Read-after-write (RAW)
a true dependency
Example
ADD R1, R2, R3
SUB R6, R7,R1

Write-after-read (WAR)
artificial dependency due to register assignment
Example
LW R1,0(R2)
ADD R2, R6, R3

Write-after-write (WAW)
artificial dependency due to register assignment
Example
LW R1, 0(R2)
ADD R1, R3, R4

Introduction to Computer Architecture

74

Taxonomy of Hazards
Data Hazards are just one type of hazard that can occur
in a machine. There are actually 3 basic types of hazards
Hazard Taxonomy
Data hazards
Instruction depends on result of prior computation which is not ready
yet

Structural hazards
HW cannot support a combination of instructions

Control hazards
pipelining of branches and other instructions which change the PC

Introduction to Computer Architecture

75

Structural Hazards
Structural hazards
HW cannot support a combination of instructions
Occurs when two or more instructions want to use the same
hardware resource in the same cycle
Causes bubble (stall) in pipelined machines
Overcome by replicating hardware resources
Examples
Multiple accesses to the register file
Branch adder and ALU
Multiple accesses to memory

Introduction to Computer Architecture

76

Structural Hazard Example 1

M
U
X

Current PC
4

W/out adder, both the address computation and the


arithmetic computation would require access to the ALU in
the same cycle
beq

r1,r2, offset

IF/ID

; if r1 == r2, then PC <-- PC + offset

ID/EX

EX/MEM

MEM/WB

Adder
<< 2

ADDER

PC

Data Memory
(RAM)

Register File
Zero

Read Reg 1
Read
Read RegData
2 1
Write RegRead

Instruction
Memory (RAM)

Write DataData 2

M
U
X

ALU

M
U
X
Read
data

Sign
extend

16
Introduction to Computer Architecture

32
77

Structural Hazard Example 2


Two instructions need access to memory in Clock
Cycle 4.
If there is only one memory port, then only one
instruction can read/write memory at a time
Program
Execution

Time
Clock
Clock Clock Clock
Clock
Clock Clock
Clock
Cycle 1 Cycle 2 Cycle 3Cycle 4 Cycle 5 Cycle 6Cycle 7 Cycle 8

LW R2, 0x10(R4)

IM

SUB R5,R6,R7

ADD R10,R11,R12

ADD R12, R10, R11

Introduction to Computer Architecture

REG
IM

ALU

REG
IM

DM
ALU

Reg
DM

REG

ALU

IM

REG

Reg
DM

ALU

Reg

DM

Reg

78

Structural Example 2 (cont)

Two instructions need access to memory in Clock Cycle 4.


If there is only one memory port, then only one instruction can
read/write memory at a time

Program
Execution

Time
Clock
Clock Clock Clock
Clock
Clock Clock
Clock
Cycle 1 Cycle 2 Cycle 3Cycle 4 Cycle 5 Cycle 6Cycle 7 Cycle 8

LW R2, 0x10(R4)

IM

SUB R5,R6,R7

ADD R10,R11,R12

Stall

ADD R12, R10, R11


Introduction to Computer Architecture

REG
IM

ALU

REG
IM

DM
ALU

REG

Reg
DM

ALU

Reg
DM

Reg

bubble bubble bubble bubble bubble


IM

REG

ALU

DM
79

Control Hazards - Branches


Example code
Address
36
40
44
48
52
56
60
64
68
72
76

Instruction
NOP
ADD R30,R30,R30
BEQ R1, R3, 24
<- this branchs to address 72
AND R12, R2, R5
OR R13, R6, R2
ADD R14, R2, R2
...
...
...
LW R4, 50(R7)
...

Flow of instructions if branch is taken: 36, 40, 44, 72, ...


Flow of instructions if branch is not taken: 36, 40, 44, 48, ...
Introduction to Computer Architecture

80

Branch Hazards
Flow of instructions if branch is taken: 36, 40, 44, 72, ...
Flow of instructions if branch is not taken: 36, 40, 44, 48, ...

Clock
Clock Clock Clock
Clock
Clock Clock Clock Clock
Cycle 1 Cycle 2 Cycle 3Cycle 4 Cycle 5 Cycle 6Cycle 7 Cycle 8 Cycle 9
44 BEQ R1, R3,IM
24

48 AND R12, R2, R5

REG

ALU

IM

52 OR R13, R6, R2

56 ADD R14, R2, R2

60 or 72 (depending on branch)
Introduction to Computer Architecture

REG
IM

DM

Reg

ALU

DM

REG

IM

ALU

REG
IM

Reg
DM

ALU

REG

Reg

DM

ALU

Reg

Reg

DM
81

Always Stalling hurts the No-branch case


Flow of instructions if branch is not taken: 36, 40, 44, 48, ...

Clock
Clock Clock Clock
Clock
Clock Clock Clock Clock
Cycle 1 Cycle 2 Cycle 3Cycle 4 Cycle 5 Cycle 6Cycle 7 Cycle 8 Cycle 9

44 BEQ R1, R3,IM


24

stall

REG
IM

stall

stall

48 AND R12, R2, R5


Introduction to Computer Architecture

ALU

DM

Reg

bubble bubble bubble


IM

bubble bubble

IM

bubble
bubble bubble

bubble bubble bubble bubble


IM

REG

ALU

DM

Reg
82

Solution: Assume Branch Not Taken

Flow of instructions if branch is taken: 36, 40, 44, 72, ...


Flow of instructions if branch is not taken: 36, 40, 44, 48, ...
Clock
Clock Clock Clock
Clock
Clock Clock Clock Clock
Cycle 1 Cycle 2 Cycle 3Cycle 4 Cycle 5 Cycle 6Cycle 7 Cycle 8 Cycle 9
44 BEQ R1, R3,IM
24

48 AND R12, R2, R5

REG
IM

52 OR R13, R6, R2

56 ADD R14, R2, R2

ALU

REG
IM

DM

Reg

ALU

DM

REG

IM

60 or 72 (depending on outcome of branch)


Introduction to Computer Architecture

ALU

REG
IM

Reg
DM

ALU

REG

Reg

DM
ALU

Reg
DM

Reg
83

What Happens When the Branch IS Taken


Flow of instructions if branch is taken: 36, 40, 44, 72, ...
Clock
Clock Clock Clock
Cycle 1 Cycle 2 Cycle 3Cycle 4
44 BEQ R1, R3,IM
24

48 AND R12, R2, R5

REG
IM

52 OR R13, R6, R2

56 ADD R14, R2, R2

72 LW R4, 50(R7)
Introduction to Computer Architecture

ALU

REG
IM

Clock
Clock Clock Clock Clock
Cycle 5 Cycle 6Cycle 7 Cycle 8 Cycle 9

DM

Reg

ALU

DM

REG

IM

ALU

REG
IM

Reg
DM

ALU

REG

Reg

DM

ALU

Reg

Reg

DM

84

Move the Branch Computation Forward


M
U
X

Current PC
4

IF/ID

ID/EX

EX/MEM

MEM/WB

Adder
<< 2

ADDER

PC

Data Memory
(RAM)

Register File
Zero

Read Reg 1
Read
Read RegData
2 1
Write RegRead

Instruction
Memory (RAM)

Write DataData 2

M
U
X

ALU

M
U
X
Read
data

Sign
extend

16

Introduction to Computer Architecture

32

85

Branch with New Datapath


Reducing penalty 1 cycle
Clock
Clock Clock Clock
Cycle 1 Cycle 2 Cycle 3Cycle 4
44 BEQ R1, R3,IM
24

48 AND R12, R2, R5

REG

IM

52 OR R13, R6, R2

72 LW R4, 50(R7)

Introduction to Computer Architecture

ALU

REG
IM

Clock
Clock Clock Clock Clock
Cycle 5 Cycle 6Cycle 7 Cycle 8 Cycle 9

DM

Reg

ALU

DM

REG

IM

ALU

REG

Reg
DM

ALU

Reg

DM

Reg

86

Move the Branch Computation Further Forward


M
U
X

Current PC
4

Compare Controls MUX Selext

IF/ID

ADDER

ID/EX

EX/MEM

MEM/WB

Adder
Compare

<< 2

ADDER

PC

Data Memory
(RAM)

Register File
Zero

Read Reg 1
Read
Read RegData
2 1
Write RegRead

Instruction
Memory (RAM)

Write DataData 2

M
U
X

ALU

M
U
X
Read
data

Sign
extend

16

Introduction to Computer Architecture

32

87

Another New and Improved Datapath


Voila - the branch delay slot
Clock
Clock Clock Clock
Cycle 1 Cycle 2 Cycle 3Cycle 4
44 BEQ R1, R3,IM
24

48 AND R12, R2, R5

REG

IM

72 LW R4, 50(R7)

Introduction to Computer Architecture

ALU

REG
IM

Clock
Clock Clock Clock Clock
Cycle 5 Cycle 6Cycle 7 Cycle 8 Cycle 9

DM

Reg

ALU

DM

REG

ALU

Reg
DM

Reg

88

Rewriting the Code for a Branch Delay Slot


Without Branch Delay Slot
Address
36
40
44
48
52
56
60
64
68
72
76

Instruction
NOP
ADD R30,R30,R30
BEQ R1, R3, 24
AND R12, R2, R5
OR R13, R6, R2
ADD R14, R2, R2
...
...
...
LW R4, 50(R7)
...

With Branch Delay Slot


Address
36
40
44
48
52
56
60
64
68
72
76

Instruction
NOP
BEQ R1, R3, 28
ADD R30, R30, R30
AND R12, R2, R5
OR R13, R6, R2
ADD R14, R2, R2
...
...
...
LW R4, 50(R7)
...

Flow of instructions if branch is taken: 36, 40, 44, 72, ...


Flow of instructions if branch is not taken: 36, 40, 44, 48, ...

Introduction to Computer Architecture

89

Performance of Pipelined Systems


Stalls due to data and branch hazards make performance
less than one instruction per cycle
Compiler is critical in determining overall performance
Compiler generates code that avoids stalls

Example
lw R15, 0x00(R2)
add R14, R15, R15
lw R16, 0x04(R2)
Might become:
lw R15, 0x00(R2)
lw R16, 0x04(R2)
add R14, R15, R15

Introduction to Computer Architecture

90

Performance of Pipelined Systems


time

Unpipelined

instructions

time

Pipelined

latency

instructions

Ideally, Throughputpipeline =

Introduction to Computer Architecture

Timesequential
Pipeline Depth
91

Pipeline Speedup and Throughput


Assume instruction execution takes N stages
s1, s2, ... sn take time t1, t2, ... tn

Without pipelining
Throughput = 1/ ti (for i = 1 to n)
Latency = 1/throughput

With pipelining
Throughput = 1/max ti <= n/ ti
Latency = n/throughput
Speedup = ti / max ti <= n

Introduction to Computer Architecture

(for i = 1 to n)

92

What Makes Pipelines Hard to Implement?


Detecting and resolving hazards
Exceptions and Interrupts
Instruction Set Architecture
CISC instructions are difficult to pipeline
Example:
stringMov from 0x1234, to 0x4000, 0x1000 bytes

Introduction to Computer Architecture

93

You might also like