0% found this document useful (0 votes)
12 views

Revised Unit IV Processor

Uploaded by

k.shyam06082005
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views

Revised Unit IV Processor

Uploaded by

k.shyam06082005
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 27

UNIT IV PROCESSOR

Instruction Execution – Building a Data Path – Designing a Control Unit – Hardwired


Control, Microprogrammed Control – Pipelining – Data Hazard – Control Hazards.

The processor fetches one instruction at a time and performs the operation specified.
Instructions are fetched from successive memory locations until a branch or a jump instruction
is encountered.
The processor uses the program counter, PC, to keep track of the address of the next instruction
to be fetched and executed.
After fetching an instruction, the contents of the PC are updated to point to the next instruction
in sequence. A branch instruction may cause a different value to be loaded into the PC.
When an instruction is fetched, it is placed in the instruction register, IR, from where it is
interpreted, or decoded, by the processor’s control circuitry. The IR holds the instruction until
its execution is completed.
Consider a 32-bit computer in which each instruction is contained in one word in the memory,
as in RISC-style instruction set architecture. To execute an instruction, the processor has to
perform the following steps:
1. Fetch the contents of the memory location pointed to by the PC. The contents of this
location are the instruction to be executed; hence they are loaded into the IR. In register
transfer notation, the required action is
IR ← [[PC]]
2. Increment the PC to point to the next instruction. Assuming that the memory is byte
addressable, the PC is incremented by 4; that is
PC ← [PC] + 4
3. Carry out the operation specified by the instruction in the IR
Fetching an instruction and loading it into the IR is usually referred to as the instruction fetch
phase. Performing the operation specified in the instruction constitutes the instruction
execution phase
With few exceptions, the operation specified by an instruction can be carried out by performing
one or more of the following actions:
• Read the contents of a given memory location and load them into a processor register.
• Read data from one or more processor registers.
• Perform an arithmetic or logic operation and place the result into a processor register.
• Store data from a processor register into a given memory location.
The processor communicates with the memory through the processor-memory interface, which
transfers data from and to the memory during Read and Write operations.
The instruction address generator updates the contents of the PC after every instruction is
fetched.
The register file is a memory unit whose storage locations are organized to form the processor’s
general-purpose registers.

Figure 1: Main hardware components of a processor


During execution, the contents of the registers named in an instruction that performs an
arithmetic or logic operation are sent to the arithmetic and logic unit (ALU), which performs
the required computation. The results of the computation are stored in a register in the register
file.

Data Processing Hardware


A typical computation operates on data stored in registers. These data are processed by
combinational circuits, such as adders, and the results are placed into a register.
A clock signal is used to control the timing of data transfers. The registers comprise edge-
triggered flip-flops into which new data are loaded at the active edge of the clock

Figure 2: Basic structure for data processing.

Figure 3: Hardware structure with multiple stages.


Instruction Execution
Load Instructions
Arithmetic and Logic Instructions
Store Instructions

• Load Instructions
Consider the instruction
Load R5, X(R7)
which uses the Index addressing mode to load a word of data from memory location X + [R7]
into register R5. Execution of this instruction involves the following actions:
Fetch the instruction from the memory.
• Increment the program counter.
• Decode the instruction to determine the operation to be performed.
• Read register R7.
• Add the immediate value X to the contents of R7.
• Use the sum X + [R7] as the effective address of the source operand, and read the contents
of that location in the memory.
• Load the data received from the memory into the destination register, R5.

assume that the processor has five hardware stages, which is a commonly used arrangement
in RISC-style processors.
Execution of each instruction is divided into five steps, such that each step is carried out by
one hardware stage. In this case, fetching and executing the Load instruction above can be
completed as follows:
1. Fetch the instruction and increment the program counter.
2. Decode the instruction and read the contents of register R7 in the register file.
3. Compute the effective address.
4. Read the memory source operand.
5. Load the operand into the destination register, R5.
Arithmetic and Logic Instructions
Instructions that involve an arithmetic or logic operation can be executed using similar steps.
They differ from the Load instruction in two ways:
• There are either two source registers, or a source register and an immediate source operand.
• No access to memory operands is required.
A typical instruction of this type is

Add R3, R4, R5


It requires the following steps:
1. Fetch the instruction and increment the program counter.
2. Decode the instruction and read the contents of source registers R4 and R5. 3. Compute
the sum [R4] + [R5].
4. Load the result into the destination register, R3.

If the instruction uses an immediate operand, as in


Add R3, R4, #1000
the immediate value is given in the instruction word. Once the instruction is loaded
into the IR, the immediate value is available for use in the addition operation. The same five-
step sequence can be used, with steps 2 and 3 modified as:
2. Decode the instruction and read register R4.
3. Compute the sum [R4] + 1000.

Store Instructions
The five-step sequence used for the Load and Add instructions is also suitable for Store
instructions, except that the final step of loading the result into a destination register is not
required.
The hardware stage responsible for this step takes no action.
For example, the instruction
Store R6, X(R8)
stores the contents of register R6 into memory location
X + [R8]. It can be implemented as follows:
1. Fetch the instruction and increment the program counter.
2. Decode the instruction and read registers R6 and R8.
3. Compute the effective address X + [R8].
4. Store the contents of register R6 into memory location X + [R8].
5. No action.
After reading register R8 in step 2, the memory address is computed in step 3 using the
immediate value, X, in the IR. In step 4, the contents of R6 are sent to the memory to be
stored. No action is taken in step 5.

Step Action
1 Fetch an instruction and increment the program counter.
2 Decode the instruction and read registers from the register file.
3 Perform an ALU operation.
4 Read or write memory data if the instruction involves a memory operand.
5 Write the result into the destination register, if needed.

Table 1: A five-step sequence of actions to fetch and execute an instruction


Hardware Components
Register File
General-purpose registers are usually implemented in the form of a register file, which is a
small and fast memory block. It consists of an array of storage elements, with access circuitry
that enables data to be read from or written into any register.

(b) Two memory blocks

Figure 4: Two alternatives for implementing a dual-ported register file


The access circuitry is designed to enable two registers to be read at the same time, making
their contents available at two separate outputs, A and B.
The register file has two address inputs that select the two registers to be read.
These inputs are connected to the fields in the IR that specify the source registers, so that the
required registers can be read.
The register file also has a data input, C, and a corresponding address input to select the register
into which data are to be written.
This address input is connected to the IR field that specifies the destination register of the
instruction.

ALU
The arithmetic and logic unit is used to manipulate data. It performs arithmetic operations such
as addition and subtraction, and logic operations such as AND, OR, and XOR. The register file
and the ALU may be connected as shown in Figure 5.

Figure 5 : Conceptual view of the hardware needed for computation.


When an instruction that performs an arithmetic or logic operation is being executed, the contents of
the two registers specified in the instruction are read from the register file and become available at
outputs A and B.
Output A is connected directly to the first input of the ALU, InA, and output B is connected to a
multiplexer, Mux B.
The multiplexer selects either output B of the register file or the immediate value in the IR to be
connected to the second ALU input.
The output of the ALU is connected to the data input, C, of the register file so that the results of a
computation can be loaded into the destination register.

Building a Data Path


Instruction processing consists of two phases: the fetch phase and the execution phase.
It is convenient to divide the processor hardware into two corresponding sections.
One section fetches instructions and the other executes them.
The section that fetches instructions is also responsible for decoding them and for generating
the control signals that cause appropriate actions to take place in the execution section
The execution section reads the data operands specified in an instruction, performs the required
computations, and stores the results
To organize the hardware into a multi-stage structure

Figure 6: A five-stage organization.


Figure 7: Datapath in a processor
Instruction Fetch Section:
The addresses used to access the memory come from the PC when fetching instructions and
from register RZ in the data path when accessing instruction operands.
Multiplexer MuxMA selects one of these two sources to be sent to the processor-memory
interface.
The PC is included in a larger block, the instruction address generator, which updates the
contents of the PC after each instruction is fetched.
The instruction read from the memory is loaded into the IR, where it stays until its execution
is completed and the next instruction is fetched.
The contents of the IR are examined by the control circuitry to generate the signals needed to
control all the processor’s hardware. They are also used by the block labelled Immediate.
A 16-bit immediate value is extended to 32 bits. The extended value is then used either directly
as an operand or to compute the effective address of an operand.
For some instructions, such as those that perform arithmetic operations, the immediate value
is sign-extended; for others, such as logic instructions, it is padded with zeros.

Figure 8: Instruction fetch section of Figure 6


Figure 9: Instruction address generator
Designing a Control Unit: There are two basic approaches:
Hardwired control and
Microprogrammed control.
The operation of the processor’s hardware components is governed by control signals. These
signals determine which multiplexer input is selected, what operation is performed by the ALU,
and so on.
In each clock cycle, the results of the actions that take place in one stage are stored in inter-
stage registers, to be available for use by the next stage in the next clock cycle.
Since data are transferred from one stage to the next in every clock cycle, inter-stage registers
are always enabled. This is the case for registers RA, RB, RZ, RY, RM, and PC-Temp.
The contents of the other registers, namely, the PC, the IR, and the register file, must not be
changed in every clock cycle.
New data are loaded into these registers only when called for in a particular processing step.
They must be enabled only at those times. The role of the multiplexers is to select the data to
be operated on in any given stage. Figure 10, shows the required control signals.
The register file has three 5-bit address inputs, allowing access to 32 general-purpose registers.
Two of these inputs, Address A and Address B, determine which registers are to be read. They
are connected to fields IR31−27 and IR26−22 in the instruction register. The third address
input, Address C, selects the destination register, into which the input data at port C are to be
written. Multiplexer MuxC selects the source of that address.

Figure 10: Control signals for the datapath.


Figure 11: Processor-memory interface and IR control signals.

Multiplexers are controlled by signals that select which input data appear at the multiplexer’s
output. For example, when B_select is equal to 0, MuxB selects the contents of register RB to
be available at input InB of the ALU. Note that two bits are needed to control MuxC and MuxY,
because each multiplexer selects one of three inputs. The operation performed by the ALU is
determined by a k-bit control code, ALU_op, which can specify up to 2k distinct operations,
such as Add, Subtract, AND, OR, and XOR shown in figure 11.
Hardwired Control
Hardwired control is discussed in this section. An instruction is executed in a sequence of
steps, where each step requires one clock cycle. Hence, a step counter may be used to keep
track of the progress of execution.
Thus, the setting of the control signals depends on:
• Contents of the step counter
• Contents of the instruction register
• The result of a computation or a comparison operation
• External input signals, such as interrupt requests
Figure 12: Generation of the control signals.

Microprogrammed Control
Control signals are generated for each execution step based on the instruction in the IR. In
hardwired control, these signals are generated by circuits that interpret the contents of the IR
as well as the timing signals derived from a step counter. Instead of employing such circuits, it
is possible to use a “software" approach, in which the desired setting of the control signals in
each step is determined by a program stored in a special memory. The control program is called
a microprogram to distinguish it from the program being executed by the processor. The
microprogram is stored on the processor chip in a small and fast memory called the
microprogram memory or the control store.
Suppose that n control signals are needed. Let each control signal be represented by a bit in an
n-bit word, which is often referred to as a control word or a microinstruction.
For example, the action of reading an instruction or a data operand from the memory requires
use of the MEM_read and WMFC signals introduced.
Figure 13: Microprogrammed control unit organization.
The address generator uses a microprogram counter, µPC, to keep track of control store
addresses when reading microinstructions from successive locations.
During step 2 in Figure 7, the microinstruction address generator decodes the instruction in the
IR to obtain the starting address of the corresponding microroutine and loads that address into
the µPC. This is the address that will be used in the following clock cycle to read the control
word corresponding to step 3.
As execution proceeds, the microinstruction address generator increments the µPC to read
microinstructions from successive locations in the control store. One bit in the
microinstruction, which we will call End, is used to mark the last microinstruction in a given
microroutine.
Microprogrammed control can be viewed as having a control processor within the main
processor.
Microinstructions are fetched and executed much like machine instructions.
Their function is to direct the actions of the main processor’s hardware components, by
indicating which control signals need to be active during each execution step.
Microprogrammed control is simple to implement and provides considerable flexibility in
controlling the execution of machine instructions. But it is slower than hardwired control. Also,
the flexibility it provides is not needed in RISC-style processors. Since the cost of logic
circuitry is no longer a significant factor, hardwired control has become the preferred choice.

PIPELINING
The speed of execution of programs is influenced by many factors. One way to improve
performance is to use faster circuit technology to implement the processor and the main
memory. Another possibility is to arrange the hardware so that more than one operation can be
performed at the same time.
In this way, the number of operations performed per second is increased, even though the time
needed to perform any one operation is not changed.
Pipelining is a particularly effective way of organizing concurrent activity in a computer
system. The basic idea is very simple. It is frequently encountered in manufacturing plants,
where pipelining is commonly known as an assembly-line operation.

Figure 14: Pipelined execution—the ideal case


Consider how the idea of pipelining can be used in a computer. The five-stage processor
organization in Figure 6 and the corresponding data path in Figure 7 allow instructions to be
fetched and executed one at a time. It takes five clock cycles to complete the execution of each
instruction.
The five stages corresponding to those in Figure 1.4 are labeled as Fetch, Decode, Compute,
Memory, and Write.
Instruction Ij is fetched in the first cycle and moves through the remaining stages in the
following cycles. In the second cycle, instruction Ij+1 is fetched while instruction Ij is in the
Decode stage where its operands are also read from the register file. In the third cycle,
instruction Ij+2 is fetched while instruction Ij+1 is in the Decode stage and instruction Ij is in the
Compute stage where an arithmetic or logic operation is performed on its operands.
Ideally, this overlapping pattern of execution would be possible for all instructions. Although
any one instruction takes five cycles to complete its execution, instructions are completed at
the rate of one per cycle.

PIPELINE ORGANIZATION
In the first stage of the pipeline, the program counter (PC) is used to fetch a new instruction.
As other instructions are fetched, execution proceeds through successive stages. At any given
time, each stage of the pipeline is processing a different instruction. Information such as register
addresses, immediate data, and the operations to be performed must be carried through the
pipeline as each instruction proceeds from one stage to the next. This information is held in
interstage buffers

Figure 15: A five-stage pipeline


PIPELINE ISSUES:
• Data Dependency
Consider the two instructions in Figure 16:
Add R2, R3, #100
Subtract R9, R2, #30
The destination register R2 for the Add instruction is a source register for the Subtract
instruction.
There is a data dependency between these two instructions, because register R2 carries
data from the first instruction to the second.
Pipelined execution of these two instructions is depicted in Figure 16.
The Subtract instruction is stalled for three cycles to delay reading register R2 until
cycle 6 when the new value becomes available.

Figure 16: Pipeline stall due to data dependency.

• Operand Forwarding
Pipeline stalls due to data dependencies can be alleviated through the use of operand
forwarding. Consider the pair of instructions discussed above, where the pipeline is
stalled for three cycles to enable the Subtract instruction to use the new value in register
R2.

Figure 17: Avoiding a stall by using operand forwarding.

The desired value is actually available at the end of cycle 3, when the ALU completes
the operation for the Add instruction.
Rather than stall the Subtract instruction, the hardware can forward the value from
register RZ.
Forwarding can also be extended to a result in register RY in Figure 17. This would
handle a data dependency such as the one involving register R2 in the following
sequence of instructions:
Add R2, R3, #100
Or R4, R5, R6
Subtract R9, R2, #3

PIPELINE HAZARDS
There are situations in pipelining when the next instruction cannot execute in the following
clock cycle. These events are called hazards, and there are three different types.
Structural Hazards
The first hazard is called a structural hazard. It means that the hardware cannot support the
combination of instructions that we want to execute in the same clock cycle.
Data Hazards
Data hazards occur when the pipeline must be stalled because one step must wait for another
to complete. Suppose you found a sock at the folding station for which no match existed.
In a computer pipeline, data hazards arise from the dependence of one instruction on an earlier
one that is still in the pipeline (a relationship that does not really exist when doing laundry).
For example, suppose we have an add instruction followed immediately by a subtract
instruction that uses the sum ($s0):
add $s0, $t0, $t1
sub $t2, $s0, $t3

Without intervention, a data hazard could severely stall the pipeline. The add instruction
doesn’t write its result until the fifth stage, meaning that we would have to waste three clock
cycles in the pipeline. Although we could try to rely on compilers to remove all such hazards,
the results would not be satisfactory. These dependences happen just too often and the delay is
just too long to expect the compiler to rescue us from this dilemma.
1. IF: Instruction fetch
2. ID: Instruction decode and register file read
3. EX: Execution or address calculation
4. MEM: Data memory access
5. WB: Write back

Figure 18: Instructions being executed using the single-cycle datapath


There are, however, two exceptions to this left-to-right flow of instructions:
■The write-back stage, which places the result back into the register file in the middle of the
data path
■ The selection of the next value of the PC, choosing between the incremented PC and the
branch address from the MEM stage.
Data flowing from right to left does not affect the current instruction; only later instructions in
the pipeline are influenced by these reverse data movements. Note that the first right-to-left
flow of data can lead to data hazards and the second leads to control hazards.
Figure 19: A pipelined sequence of instructions

A notation that names the fields of the pipeline registers allows for a more precise notation of
dependences.
For example, “ID/EX.RegisterRs” refers to the number of one register whose value is found in
the pipeline register ID/EX; that is, the one from the first read port of the register file.
The first part of the name, to the left of the period, is the name of the pipeline register; the
second part is the name of the field in that register. Using this notation, the two pairs of hazard
conditions are
1a. EX/MEM.RegisterRd = ID/EX.RegisterRs
1b. EX/MEM.RegisterRd = ID/EX.RegisterRt
2a. MEM/WB.RegisterRd = ID/EX.RegisterRs
2b. MEM/WB.RegisterRd = ID/EX.RegisterRt
Control Hazards
The third type of hazard is called a control hazard, arising from the need to make a decision
based on the results of one instruction while others are executing.
1. Instruction fetch: The instruction is read from memory using the address in the PC and then
is placed in the IF/ID pipeline register. This stage occurs before the instruction is identified, so
the top portion of Figure 20 works for store as well as load.
2. Instruction decode and register file read: The instruction in the IF/ID pipeline register
supplies the register numbers for reading two registers and extends the sign of the 16-bit
immediate. These three 32-bit values are all stored in the ID/EX pipeline register. The bottom
portion of Figure 4.36 for load instructions also shows the operations of the second stage for
stores. These first two stages are executed by all instructions, since it is too early to know the
type of the instruction.
3. Execute and address calculation: Figure 20 shows the third step; the effective address is
placed in the EX/MEM pipeline register.
4. Memory access: The top portion of Figure 20 shows the data being written to memory. Note
that the register containing the data to be stored was read in an earlier stage and stored in ID/EX.
The only way to make the data available during the MEM stage is to place the data into the
EX/MEM pipeline register in the EX-stage, just as we stored the effective address into
EX/MEM.
5. Write-back: The bottom portion of Figure 20 shows the final step of the store. For this
instruction, nothing happens in the write-back stage. Since every instruction behind the store
is already in progress, we have no way to accelerate those instructions. Hence, an instruction
passes through a stage even if there is nothing to do, because later instructions are already
progressing at the maximum rate.
Figure 20: The pipelined version of the datapath
Figure 21: On the top are the ALU and pipeline registers before adding forwarding.

Branch Delays
In ideal pipelined execution a new instruction is fetched every cycle, while the preceding
instruction is still being decoded. Branch instructions can alter the sequence of execution, but
they must first be executed to determine whether and where to branch.

Conditional Branches
Consider a conditional branch instruction such as Branch
_if_[R5]=[R6] LOOP
The execution steps for this instruction are shown in Figure 22. The result of the comparison
in the third step determines whether the branch is taken. For pipelining, the branch condition
must be tested as early as possible to limit the branch penalty. We have just described how the
target address for an unconditional branch instruction can be determined in the Decode stage.
Similarly, the comparator that tests the branch condition can also be moved to the Decode stage,
enabling the conditional branch decision to be made at the same time that the target address is
determined. In this case, the comparator uses the values from outputs A and B of the register
file directly. Moving the branch decision to the Decode stage ensures a common branch penalty
of only one cycle for all branch instructions. In the next two sections, we discuss additional
techniques that can be used to further mitigate the effect of branches on execution time.
Un-Conditional Branches
Figure 22 shows the pipelined execution of a sequence of instructions, beginning with an
unconditional branch instruction, Ij.
The next two instructions, Ij+1 and Ij+2, are stored in successive memory addresses following
Ij. The target of the branch is instruction Ik. According to Figure 22, the branch instruction is
fetched in cycle 1 and decoded in cycle 2, and the target address is computed in cycle 3. Hence,
instruction Ik is fetched in cycle 4, after the program counter has been updated with the target
address. In pipelined execution, instructions Ij+1 and Ij+2 are fetched in cycles 2 and 3,
respectively, before the branch instruction is decoded and its target address is known. They
must be discarded.
The resulting two-cycle delay constitutes a branch penalty. Branch instructions occur
frequently. In fact, they represent about 20 percent of the dynamic instruction count of most
programs. (The dynamic count is the number of instruction executions, taking into account the
fact that some instructions in a program are executed many times, because of loops.) With a
two-cycle branch penalty, the relatively high frequency of branch instructions could increase
the execution time for a program by as much as 40 percent. Therefore, it is important to find
ways to mitigate this impact on performance.
Reducing the branch penalty requires the branch target address to be computed earlier in the
pipeline. Rather than wait until the Compute stage, it is possible to determine the target address
and update the program counter in the Decode stage.
Thus, instruction Ik can be fetched one clock cycle earlier, reducing the branch penalty to one
cycle, as shown in Figure23. This time, only one instruction, Ij+1, is fetched incorrectly,
because the target address is determined in the Decode stage.

Figure:22 Branch penalty when the target address is determined in the Compute stage of the
pipeline.
Figure 23: Branch penalty when the target address is determined in the Decode stage of the
pipeline.
Branch Prediction
To reduce the branch penalty further, the processor needs to anticipate that an instruction
being fetched is a branch instruction and predict its outcome to determine which instruction
should be fetched in cycle 2.
Static Branch Prediction
The simplest form of branch prediction is to assume that the branch will not be taken and to
fetch the next instruction in sequential address order. If the prediction is correct, the fetched
instruction is allowed to complete and there is no penalty. However, if it is determined that the
branch is to be taken, the instruction that has been fetched is discarded and the correct branch
target instruction is fetched. Misprediction incurs the full branch penalty. This simple approach
is a form of static branch prediction.
Dynamic Branch Prediction

To improve prediction accuracy further, we can use actual branch behaviour to influence the
prediction, resulting in dynamic branch prediction. The processor hardware assesses the
likelihood of a given branch being taken by keeping track of branch decisions every time that
a branch instruction is executed. In its simplest form, a dynamic prediction algorithm can use
the result of the most recent execution of a branch instruction. The processor assumes that the
next time the instruction is executed, the branch decision is likely to be the same as the last
time. Hence, the algorithm may be described by the two-state machine in Figure 24.
The two states are:
LT - Branch is likely to be taken LNT - Branch is likely not to be taken
Figure 24: State-machine representation of branch prediction algorithms

You might also like