CENG 450 Lab Project Report
CENG 450 Lab Project Report
CENG 450
Computer Systems and Architecture
Spring 2018
2 Considerations 2
2.1 Design Requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
2.2 Project Time Line . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
2.3 Pipeline Hazards . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.3.1 Structural Hazards . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.3.2 Control Hazards . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.3.3 Data Hazards . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
3 Pipeline Components 5
3.1 ROM Module . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
3.2 Program Counter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
3.3 Register File . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
3.4 ALU . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
3.5 RAM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
4 Pipeline 6
4.1 Control Unit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
4.2 Fetch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
4.3 Decode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
4.4 Execute . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
4.5 Memory Access . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
4.6 Write Back . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
4.7 ALU . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
6 Observations 13
7 Performance 14
7.1 Testing Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
7.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
7.2.1 Critical Path . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
7.2.2 CPI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
7.2.3 Clock Rate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
7.2.4 Simulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
8 Contributions 16
9 Conclusion 16
10 Recommendation 17
11 References 18
12 Appendices 19
List of Figures
1 CPU System Schematic. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
2 ROM Diagram. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
3 CPU System Diagram. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
4 Control Unit FSM Flow Chart. . . . . . . . . . . . . . . . . . . . . . . . . . 6
5 Fetch Block Diagram. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
6 Hazard detection FIFO and valid bits. . . . . . . . . . . . . . . . . . . . . . 9
7 Testing System Schematic. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
8 Test 1 time 1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
List of Tables
1 Pipeline Structural Hazard Example. . . . . . . . . . . . . . . . . . . . . . . 3
2 Pipeline Control Hazard Example. . . . . . . . . . . . . . . . . . . . . . . . 4
3 Pipeline Data Hazard Example. . . . . . . . . . . . . . . . . . . . . . . . . . 4
4 Control Unit Outputs to Pipeline Stages. . . . . . . . . . . . . . . . . . . . 7
5 Control Unit Outputs to Pipeline Stages cont’d. . . . . . . . . . . . . . . . . 7
6 Control Unit FSM State Table . . . . . . . . . . . . . . . . . . . . . . . . . 8
7 ALU instruction after ALU instruction hazard. . . . . . . . . . . . . . . . . 10
8 IN instruction hazard. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
9 LOADIMM hazard. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
10 MOV hazard. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
11 TEST branch instruction hazard. . . . . . . . . . . . . . . . . . . . . . . . . 11
12 Store after ALU instruction hazard. . . . . . . . . . . . . . . . . . . . . . . 12
13 Store after load hazard. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1 Introduction
The purpose of this lab project was to design and implement a 16-bit CPU for a specified
instruction set architecture using a Xilinx Spartan-3E family FPGA. The CPU design is to
be implemented by programming the FPGA in VHDL using the Xilinx ISE. The project
was introduced January 24th and was to be completed by April 5th 2018.
1
2 Considerations
Given the complexity of this project, it is important to outline the criteria and challenges
to convey the design approach. This section covers design requirements, project time line,
and a brief discussion of the inherent design challenges.
Final CPU designs are tested using the provided format tests, as well as, 3 final test codes
that are traditional programs employing all instruction formats.
2
2.3 Pipeline Hazards
Pipeline hazards are the most difficult design challenge when implementing a pipelined
CPU. Hazards are manifested by the pipeline operation itself. Managing pipeline hazards
is incumbent when designing a pipelined CPU. These hazards can be categorized into three
subsets: structural, control, and data.
Structural hazards do not exist within this CPU design because the components used in
the fetch and memory access stages of the pipeline are unique. In this CPU design, a ROM
component stores the program and a RAM module is used as the memory storage solution.
A popular method of circumventing this hazard is to employ a dual-ported memory module.
3
2.3.2 Control Hazards
Control hazards manifest from pipelined branching operations and other operations that
change the program counter. The easiest method of handling pipelined branching oper-
ations is to implement a stalling mechanism. The purpose of this stalling or ”bubbling”
mechanism is to permit the resolution of the conditional branch so that the new computed
destination can be loaded into the program counter. Table. 2 illustrates a stalling mech-
anism. As a branch instruction is decoded, the successive instruction is stalled and the
branch instruction carries through the pipeline to be resolved. Once the branch is resolved,
the PC counter is updated and the pipeline continues.
A method used to solve this common RAW scenario is operand forwarding. This technique
derives from the observation that the newly computed value for R1 is available following
the execution stage. Control logic detects this hazard by checking if the previous instruc-
tion’s source destination is the same as a current instruction’s source. If this is true, then
the previous instruction’s execution result is fed back into the current instruction’s exe-
cute stage, maintaining data integrity. Additional logic is required to select the correct
forwarding operand in order to maintain data integrity.
4
3 Pipeline Components
The pipelined CPU is composed of a number of components that accomplish each stage’s
function. This section does not cover stage specific controllers and multiplexers, and rather
focuses on the functional units.
The ROM module is found in the fetch stage of the pipeline. The ROM receives an address
from the program counter and outputs the corresponding instruction. The ROM array is
byte addressable and 2 KB in size.
3.4 ALU
3.5 RAM
5
4 Pipeline
4.1 Control Unit
The Control unit is a finite state machine (FSM) that interacts with all pipeline stages.
The FSM is controlled using the previous state and a signal from the decode stage. The
Control Unit FSM consists of several internal states and 4 output states. The internal
states govern the sequence and selection of output state signals. The primary function of
the Control Unit is to insert ”bubbles” to enable branch handling.
The flow chart seen in Figure. 4 illustrates the FSM internal state sequence. Following
a Control Unit RESET state, a RUN state is always engaged. The RUN state is only
disengaged when
6
The following tables 4 and 5 illustrate the resulting Control Unit outputs to the pipeline
stages given the Control Unit internal states seen in Figure 4:
Table. 6 shows all possible states of the stage FSM and the interaction with the RAM
and Register File, Program Counter value, and stage output. When the RESET state is
asserted, all data in all the stages is cleared and the PC value is set to zero. The RUN
state operates the stages normally. The STALL state retains the current values inside the
stages, as well as, their outputs. The WRITE PC state acts identically to STALL, except
that the PC value is updated.
7
State Stage
Component RESET RUN STALL WRITE PC
Keep current Keep current
Mem. & Reg. Set to 0. Run normally
value value
Keep current Update PC
PC Value Set to 0. Run normally
value if branch taken
Keep current Output
Stage Output Set to 0. Run normally
value previous value
4.2 Fetch
8
4.3 Decode
4.4 Execute
4.5 Memory Access
4.6 Write Back
4.7 ALU
Figure. 6 contains a hazard detection example. When the write instruction is clocked
in at t = 10 ns, the write destination is stored in the write destination FIFO (haz-
ard destinations). The first bit of the validity indicator vector (hazard writes) is set to
1. All the other bits on the validity indicator vector is set to 0. At t = 30 ns, a NOP
9
instruction is clocked in. Nothing new is written in the FIFO. However, the previous write
destination moves through the FIFO. At t = 50 ns, a new write instruction is clocked in.
The write destination FIFO and validity indicator vector is updated with new data.
Another hazard is any instruction after the IN instruction . The hazard is shown in
Figure. 8. At clock 4, memory access stage (instruction 1) latches the IN data. R1’s
value becomes the IN data. At the same time in clock 4, the IN data is forwarded to the
10
execute stage (instruction 2). At clock 5, the R1’s value is forwarded to instruction 3 and
instruction 4. This resolves the IN data hazard.
Figure. 9 shows the LOADIMM hazard. The hazard occurs when any other instruction
arrives after the LOADIMM instruction. In clock 2, R7’s value is retrieved and the upper
bits are updated to the value 9. In clock 3, R7’s value is forwarded from execute stage
(instruction 1) to decode stage (instruction 2). At the same time in clock 3, the lower bits
of R7 are updated to the value 4. The muxes in the decode stage handle the forwarding.
In clock 4, the correct R7 value is forwarded again from execute stage (instruction 2) to
decode stage (instruction 3). The forwarding resolves all hazards.
Figure. 10 shows the MOV hazard. The hazard occurs when any other instruction
arrives after the MOV instruction. At clock 2, R2’s value is retrieved. At clock 3, the R2’s
value is forwarded from execute stage (instruction 1) to decode stage (instruction 2). At
clock 4, R2’s value is available to the ALU.
Figure. 11 shows the TEST branch instruction hazard. The hazard occurs when a
combination of ALU instructions, TEST instructions, and branch instructions happen in-
succession. At clock 4, R0’s value is computed. At clock 5, R0’s value is forwarded from
11
memory access stage (instruction 2) to execute stage (instruction 3). At the same time in
clock 5, the branch instruction is stalled by one cycle. In clock 6, R1’s value is forwarded
from the register file to the decode stage of branch instruction. In clock 6, negative zero
flags are forwarded from the test instruction to branch instruction. The branch destination
is computed in clock 7.
Figure 12 shows store after ALU instruction hazard. At clock 3, R1’s value is computed
by the ALU. At clock 4, decode stage of the store instruction captured R1’s value. At the
same time in clock 4, R0’s value is computed. In clock 5, R0’s value captured by a special
FIFO in the memory access stage. At clock 6, R0’s value and R1’s value are available to
the memory access stage. This allows the store function to store R1’s value at the memory
location of R0’s value. A similar hazard mitigation procedure happens for the load after
ALU hazard.
The second hazard is a store instruction after a load instruction. If the load instruction
loads a value into a register and that register contains the value for write memory location
in the store instruction then a hazard occurs. The hazard is mitigated when the loaded
value is stored in a special FIFO in the memory access stage. At the memory access stage
of the store instruction, the value is retrieved from the special FIFO.
Figure 13 shows an example of the store after load hazard. At clock 4, the loaded R1
value is written into a special FIFO. At the clock 5, R1’s value is retrieved from the special
FIFO. Then R4’s value is stored at the memory location of R1’s value.
5.3 Multiplication
12
Clock Cycle Number
Instruction 1 2 3 4 5 6 7
LOAD R1, R0 IF D EX M WB
STORE R4, R1 IF D EX M WB
6 Observations
13
7 Performance
7.1 Testing Methodology
Final Tests 1-3 ROM files provided by Lab TA Ibrahim Hazmi will be used as the CPU
test code. Additionally, Ibrahim’s 7-Segment Display Controller will be used to display
CPU output data in hexadecimal format. The 7-segment display is integrated into the
FPGA evaluation board. CPU input data is controlled via the mapped FPGA evaluation
board SPST switches. A function generator outputting a 3.3V square wave with variable
frequency is used as the CPU clock signal. The CPU reset signal is also mapped to an
FPGA evaluation board push switch. The Testing system schematic is illustrated in Figure
7.
7.2 Results
7.2.1 Critical Path
The critical path is the execution stage of the pipeline. The execution stage contains the
ALU, 4 muxes, forwarding and a direct path from the register file. The multiplier is the
slowest component of the ALU. Multiplier uses the DSP48A block without any pipelining
and zero stalls. The DSP48A multiplier mode maximum delay is 6 ns. The maximum delay
for the 2 layers of muxes is 3 ns. The read period from the register file is 4 ns. Therefore,
the maximum delay is 13 ns. As a result, the period of the execute stage correlates with
the maximum frequency of 78 MHz.
7.2.2 CPI
Given an input of 7, Test 1 program requires 70 instructions to complete. Using timing
analysis from the Vivado ISE, we were able to calculate the number of elapsed clock cycles
required to produce the correct resultant. The equations below demonstrate the calculated
CPI of 1.58
Cycles
CP IT est1 = Instructions = 20+3+3+1
20 = 1.35
Test 2 has 9 instructions per loop. The total penalties are fixed at a 7 cycles per loop.
Therefore the average CPI is 1.77.
Cycles 9+3+3+1
CP IT est2 = Instructions = 9 = 1.77
14
Test 3 has 5 instructions per loop. The total penalties are fixed at a 7 cycles per loop.
Therefore the average CPI is 2.4.
Cycles 5+3+3+1
CP IT est3 = Instructions = 5 = 2.4
7.2.4 Simulations
15
8 Contributions
9 Conclusion
16
10 Recommendation
17
11 References
18
12 Appendices
19
Figure 8: Test 1 time 1.
20