COAL Assignment (Y86 Processor Architecture)
COAL Assignment (Y86 Processor Architecture)
Section: C
Processor Architecture:
The word “architecture” typically refers to building design and construction. The most important
type of hardware design is computer’s architecture. The design of the processor determines what
software can run on the computer and what other hardware components are supported. For
example, Intel's x86 processor architecture is the standard architecture used by most PCs.
Registers:
Registers are high performance storage used to manipulate data. There are eight program
registers:
Program registers
Programs for Y86 access and modify the program registers and %eax, %ecx, %edx, %ebx, %esi,
%edi, %esp, and %ebp each stores a word. Register %esp is used as the stack pointer register by
the push, pop, call and return instructions. Otherwise, the registers have no fixed meaning or
values.
Condition code:
There are three single-bit condition codes, ZF, SF, and OF, storing information about the effect
of the most recent arithmetic or logical instruction.
Status registers:
This indicates whether the program is running normally, or some special event has occurred.
Some instructions are just 1 byte long, but those that require operands have longer
encodings. First, there can be an additional register specifier byte, specifying either one
or two registers. These register fields are called rA and rB
Instructions that have no register operands, such as branches and call, do not have a
register specifier byte.
Those that require just one register operand (irmovl, pushl, and popl) have the other
register specifier set to value 0xF.
Some instructions require an additional 4-byte constant word. This word can serve as the
immediate data for irmovl, the displacement for rmmovl and mrmovl address specifiers,
and the destination of branches and calls.
Simple instruction:
Halt
The halt instruction stops instruction execution. . For Y86, executing the halt
instruction causes the processor to stop, with the status code set to HLT.
Byte 0 1 2 3 4 5 6
halt
0 0
nop
This program instruction does nothing.
Byte 0 1 2 3 4 5 6
nop
1 0
2) Move instruction:
The movl instruction is split into four different instructions, the source is either immediate(i),
register(r) or memory(m) and is designated by the first character in the instruction name. The
destination is either register(r) or memory (m) and is designated by the second character in the
instruction name.
This instruction is immediate to register move instruction and l in the instruction name
indicates that we are moving a double word.
Byte 0 1 2 3 4 5 6
irmovl (v ,rB)
3 0 F rB V
Immediate refers to the constant value that we want to encode in the instruction.
So the instruction R[%rB] val moves a constant value to the register. Suppose the
constant value is CODE and we want to move it into register %rB (any value
suppose 9) .
This instruction is a register to register move instruction while the l in the instruction
name indicates that we are moving a double word.
Byte 0 1 2 3 4 5 6
rrmovl (rA ,rB) 2 0 rA rB
The above instruction works in a way like R[%rB] <- R[%rA] here , the register rA is source and
%rB destination register. Suppose the value in register %rA is 8 and %rB is 9.
rrmovl (%rA,%rB)
2 0 8 9
Encoding:
While encoding the above instruction we will take the lower order byte first that is 20 and
then the next byte that is 89. The above instruction will be encoded in memory as
0x8920.
mrmovl: (memory(m) to register (r) move instruction)
This instruction is a memory(m) to register move instruction where l in the instruction
name indicates that we are going to move a double word.
Byte 0 1 2 3 4 5 6
The place where we are going to put the value is in a register file. For this Y86 processor
uses a mode called register + offset which means that we read a value out of a register
and we add an offset to that value that will give us the location in memory that we we are
going to use to access the data.
So with an instruction mrmovl 4%(rB), %rA (assume r[%rB]=0x4000)
3) Stack operations:
pushl
The pushl and popl instructions implement push and pop, just as they do in IA32.The
pushl rA instruction works in a way that it first moves the stack pointer by subtracting 4
in case of a double word to create space for the new value and then put the new value into
that memory location.
Byte 0 1 2 3 4 5 6
pushl rA
A 0 rA F
Encoding:
pushl rA A 0 0 F
Popl:
The popl instruction basically reverses the order in which we pushed the value in the
stack. It first takes the memory referenced by the stack pointer and put into rA. Then after
that increment the stack pointer.
Byte 0 1 2 3 4 5 6
Popl rA
B 0 rA F
Encoding:
popl rA B 0 0 F
Byte 0 1 2 3 4 5 6
OPl rA ,rB
6 fn rA rB
5) Jump instructions:
The seven jump instructions are jmp, jle, jl, je, jne, jge, and jg. Branches are taken according
to the type of branch and the settings of the condition codes. The branch conditions are the same
as with IA32.
Byte 0 1 2 3 4 5 6
jXX Dest 7 fn Dest
Byte 0 1 2 3 4 5 6
cmovXX rA, rB
2 fn rA rB
Call Dest
8 0 Dest
ret
The ret instruction returns from such a call.
Byte 0 1 2 3 4 5 6
ret
9 0
RISK stands for Reduced Instruction Set Computer, and is a type of microprocessor
architecture that utilizes a small, highly optimized set of instructions rather than a more
specialized set of instructions often found in other type of architectures.
RISK
CISK
1) CISK emphasizes efficiency in 1) RISC emphasizes efficiency in cycles
instruction program. per instruction.
2) CISC has an emphasis on smaller code 2) RISC needs more RAM.
size and uses less RAM overall than
CISC.
3) Variable-length encodings. IA32 3) Fixed-length encodings. Typically all
instructions can range from 1 to 15 instructions are encoded as 4 bytes.
bytes
4) Arithmetic and logical operations 4) Arithmetic and logical operations only
can be applied to both memory and use register operands. Memory referencing
register operands. is only allowed by load instructions. This
convention is referred to as load/store
architecture.
5) CISK is the original microprocessor 5) RISK is redesigned ISA that emerged in
ISA the early 1980s.
6) In CISK, instructions can take several 6)Single cycle instructions
clock cycles.
7) Hardware Centric Design 7) Software Centric Design
The ISA does as much as possible High level compilers take most of the
using hardware circuitry burden of coding many software steps
from the programmer
8) Condition codes. Special flags are set as 8) No condition codes. Instead, explicit test
a side effect of instructions and then used instructions store the test results in normal
for conditional branch testing registers for use in conditional evaluation.
We describe a processor called SEQ (for “sequential” processor). On each clock cycle, SEQ
performs all the steps required to process a complete instruction. This would require a very long
cycle time, however, and so the clock rate would be unacceptably low. Our purpose in
developing SEQ is to provide a first step toward our ultimate goal of implementing an efficient,
pipelined processor.
SEQ Hardware Structure:
The computations required to implement all of the Y86 instructions can be organized as a
series of six basic stages: fetch, decode, execute, memory, write back, and PC update.
The hardware units are associated with the different processing stages:
Fetch
Using the program counter register as an address, the instruction memory reads the bytes
of an instruction. The PC incrementer computes valP, the incremented program counter.
Decode
The register file has two read ports, A and B, via which register values valA and valB are
read simultaneously.
Execute
The execute stage uses the arithmetic/logic (ALU) unit for different purposes according
to the instruction type. For integer operations, it performs the specified operation. For
other instructions, it serves as an adder to compute an incremented or decremented stack
pointer, to compute an effective address, or simply to pass one of its inputs to its outputs
by adding zero. The condition code register (CC) holds the three condition-code bits.
New values for the condition codes are computed by the ALU. When executing a jump
instruction, the branch signal Cnd is computed based on the condition codes and the jump
type.
Memory:
The data memory reads or writes a word of memory when executing a memory
instruction. The instruction and data memories access the same memory locations, but for
different purposes.
Write back
The register file has two write ports. Port E is used to write values computed by the ALU.
while port M is used to write values read from the data memory.
Diagram:
Fetch
Using the program counter register as an address, the instruction memory reads the bytes
of an instruction. The PC incrementer computes valP, the incremented program counter.
For the fetch stage our goal is to read the instruction from memory.
The parts of processor involved in the fetch stage:
(1) The program counter
(2) Memory
(3) PC increment
(4) Logic to identify invalid instruction
Explanation:
We have memory and a program counter and we are going to use the value in the
program counter to read something from memory and then will take what is in
memory and will load it in some internal Register called instruction.
Diagram:
Rounded boxes represent logic and oval shapes represent internal registers.
Now instead of just reading the instruction and one big register, we will divide
the big register into small registers because we know that the op code or i code is
going to be crucial when determining the instruction that we are executing. Many
of these instructions will have registers rA and rB so when we read from memory
we are going to populate these registers and then based on the op code that will
tell us how many bytes we have to read we will figure out how to increment PC
and produce a potential next PC value which will be called valP.
Decode
This phase determines what to read from register file and then read those values.The
register file has two read ports, A and B, via which register values valA and valB are read
simultaneously. The parts of the processor involved in decode stage are:
(1) rA, rB and icode from the instruction
(2) Register file
(3) Logic to determine which registers are used to produce valA and valB
Explanation:
Normally, We will take rA and rB and pass it in some logic that will then tell
the register file which registers to read. By reading those registers we are
going to produce values that we are going to call valA and valB and we are
going to name some of the signals in inputs and outputs. This implementation
may work on move instructions, jump instructions and ALU instructions but
not for push/pop and call/ret instructions.
Push, pop, call and return have some implied register called %esp which is
the stack pointer and is in the register file. Sometimes we are going to need
that to read values. In addition to sending rA and rB in the logic blocks we
also send the Op code.
Diagrams:
Execute
This stage basically uses the ALU and set the condition codes. The parts of the processor
involved in this stage are:
(1) valA, valB from the register file
(2) valC from the instruction
(3) ALU
(4) Condition Codes
(5) Logic to a) Select inputs in ALU, b) Set Condition codes
Explanation:
The execute stage uses the arithmetic/logic (ALU) unit for different purposes
according to the instruction type. For integer operations, it performs the
specified operation. For other instructions, it serves as an adder to compute an
incremented or decremented stack pointer, to compute an effective address, or
simply to pass one of its inputs to its outputs by adding zero. The condition
code register (CC) holds the three condition-code bits. New values for the
condition codes are computed by the ALU. When executing a jump
instruction, the branch signal Cnd is computed based on the condition codes
and the jump type.
Diagram:
Push and Pop Instructions:
Push, pop, call and ret instructions also need to decrement the stack. We use
ALU to add 0 to some other instructions particularly register to register move
and the irmovl . We can actually reduce the wiring and allow valA to link
directly to the register file for a larger class of instructions. We have to think
when we increment or decrement the value of the stack where we have to
feed the plus or minus. In stack pointer, valB will come out and get into the
ALU B and then the plus or minus to go in the ALU A. On the other hand for
immediate or register to register move the register we are reading is valA and
going to go in the ALU A. ALU B will have 0 and so we normally get valB
into the ALU and now we want input to make decision about which one to
pick so we will extend our icode cuircuit and icode signal to also go into the
ALU. Next we will set the condition codes only if we have used ALU so now
we will use a logic block.
Diagram:
Memory
The data memory reads or writes a word of memory when executing a memory
instruction. The instruction and data memories access the same memory locations, but for
different purposes. The parts of processor involved in this stage are:
(1) Memory
(2) valE (address)
(3) valA (Address)
(4) valP (from the PC increment)
Explanation:
We use the ALU for address calculation and after that valE contains our
address so clearly that will go into a logic block that tells us where we are
reading and writing data in our memory. In a case when we want to write
something into memory, we will use register rB and valB to do address
calculation so the contents of valA we want to write into memory so we are
going to grab that signal and sent it into the data logic. We are goint to
introduce a new logic that tells what register to use to hold data from
memory and source for this will be valA and rA will be the destination
register and we will feed that into dstM.
Diagram:
Write back
In this phase we write values into the register files. The parts of the processor involved in
this phase are:
(1) Register file
(2) ValM (from memory)
(3) ValE (from the ALU)
(4) Cond (Conditional move)
Explanation:
First of all we are going to introduce an E port and the value is going to come
from valE. Normally it comes from rB but depending on op code we may or
may not write this.In normal case when we want to write into register rB we
are going to tell dstE the register in which we are going to write things and
then op code will decide when to do that. Next we are going to take the
condition code ann wire that into dstE so gthe logic in dstE is normally allows
to take rB.
Diagram:
Push and Pop instruction:
In case of push and pop instruction it might say grab rB and we are going to
use stack pointer to update path so for push, pop, ret and call dstE is
responsible for figuring that out.
Update the PC:
The last stage is to update the program counter register and it figures out the
address of next instruction to be executed. In this stage we are going to take
valM that is one of the inputs the other input is valP and then to decide what
we are going to do we need a condition code here in a way of conditional
jump that is going to determine where we go and we also need valC .
basically , valP, valM, valC and condition codes decides the address of the
next instruction to be executed.
Diagram:
Y86 implementation Observation:
We only read instruction in the Fetch stage
We only read from the register file in the Decode stage
We only use the ALU in the Execute stage
We only read/write to memory in the Memory phase
We only write to the register fiule in the write back stage
For any given instruction, we have to wait for the signals to propagate through the entire
circuit but, At any given instant, most of the hardware is unused
key feature of pipelining is that it increases the throughput of the system, that is, the number of
customers served per unit time, but it may also slightly increase the latency, that is, the time
required to service an individual customer. For example, a customer in a cafeteria who only
wants a salad could pass through a non pipelined system very quickly, stopping only at the salad
stage. A customer in a pipelined system who attempts to go directly to the salad stage risks
incurring the wrath of other customers.
In our sequential implementation, we execute single instruction and go through all the stages and
then we go to next instruction and execute it but in pipelined implementation when first function
is reading in the register file there is an instruction behind it that is reading in instruction from
memory. We go to the next phase and the instruction that started first is already read from the
register file and now it is going in execute phase. Meanwhile the instruction right behind it can
decode and the line behind that can do fetch so in theory we can say that we can have five
different instructions being executed at the same time on the hardware.
Pipelining implementation:
The pipeline registers are labeled as follows:
D sits between the fetch and decode stages. It holds information about the most recently
fetched instruction for processing by the decode stage.
E sits between the decode and execute stages. It holds information about the most recently
decoded instruction and the values read from the register file for processing by the execute stage.
M sits between the execute and memory stages. It holds the results of the most recently
executed instruction for processing by the memory stage. It also holds information about branch
conditions and branch targets for processing conditional jumps.
W sits between the memory stage and the feedback paths that supply the computed results
to the register file for writing and the return address to the PC selection logic when completing a
ret instruction
Fetch stage:
In fetch stage we have some address for which we want to fetch and we are going to read from
memory, get an instruction and then send that instruction to logic which is going to collect that
logic and latch it into the decode register. Latch means that it has captured the signal that is
flowing into the register and holds them there even if the signals flow below or outside the
register. The register will hold that values until the next time when the clock cycle passes and we
ask it to read new values. Now we will call it calcPC.
Decode stage:
The decode stages of SEQ+ and PIPE– both generate signals dstE and dstM indicating the
destination register for values valE and valM. In SEQ+, we could connect these signals directly
to the address inputs of the register file write ports. With PIPE–, these signals are carried along
in the pipeline through the execute and memory stages, and are directed to the register file only
once they reach the writeback stage (shown in the more detailed views of the stages). We do this
to make sure the write port address and data inputs hold values from the same instruction.
Otherwise, the write back would be writing the values for the instruction in the write-back stage,
but with register IDs from the instruction in the decode stage. As a general principle, we want to
keep all of the information about a particular instruction contained within a single pipeline stage.
One block of PIPE– that is not present in SEQ+ in the exact same form is the block labeled
“Select A” in the decode stage. We can see that this block generates the value valA for the
pipeline register E by choosing either valP from pipeline register D or the value read from the A
port of the register file. This block is included to reduce the amount of state that must be carried
forward to pipeline registers E and M. Of all the different instructions, only the call requires valP
in the memory stage. Only the jump instructions require the value of valP in the execute stage (in
the event the jump is not taken). None of these instructions requires a value read from the
register file. Therefore, we can reduce the amount of pipeline register state by merging these two
signals and carrying them through the pipeline as a single signal valA. This eliminates the need
for the block labeled “Data” in SEQ (Figure 4.23) and SEQ+ (Figure 4.40), which served a
similar purpose. In hardware design, it is common to carefully identify how signals get used and
then reduce the amount of register state and wiring by merging signals such as these.
Next PC prediction:
Our goal in the pipelined design is to issue a new instruction on every clock cycle, meaning that
on each clock cycle, a new instruction proceeds into the execute stage and will ultimately be
completed. Achieving this goal would yield a throughput of one instruction per cycle. To do this,
we must determine the location of the next instruction right after fetching the current instruction.
Unfortunately, if the fetched instruction is a conditional branch, we will not know whether or not
the branch should be taken until several cycles later, after the instruction has passed through the
execute stage. Similarly, if the fetched instruction is a ret, we cannot determine the return
location until the instruction has passed through the memory stage. With the exception of
conditional jump instructions and ret, we can determine the address of the next instruction based
on information computed during the fetch stage. For call and jmp (unconditional jump), it will be
valC, the constant word in the instruction, while for all others it will be valP, the address of the
next instruction. We can therefore achieve our goal of issuing a new instruction every clock
cycle in most cases by predicting the next value of the PC. For most instruction types, our
prediction will be completely reliable. For conditional jumps, we can predict either that a jump
will be taken, so that the new PC value would be valC, or we can predict that it will not be taken,
so that the new PC value would be valP. In either case, we must somehow deal with the case
where our prediction was incorrect and therefore we have fetched and partially executed the
wrong instructions.
This technique of guessing the branch direction and then initiating the fetching of instructions
according to our guess is known as branch prediction. It is used in some form by virtually all
processors.
Execution stage:
We will add pipeline registers at every point in the pipeline. The stage is basically the logic
between the two pipeline registers and with each stage there is a pipeline register that holds the
signal that is going to compute on. So what happens in every stage actually depends on the
values of particular registers of that stage.
To control the ALU we need to know that what function is ALU is supposed to do and we will
figure out it using a piece of logic. For icode , we know that we need it in the execution stage.
Next is ifun function we need ifun because we need to tell the ALU which function to execute.
rA and rB are the register numbers and we use them to decode stage to read the register file after
that stage we do not need them but we need what those registers have in execute stage so we
need valA that came out of the register file from A and valB that came out from the register file
from B. We will need valC as well and valP is value of next program counter and we use it to
figure out PC so we donot need that in execute stage. From the instruction point of view rA and
rB sometimes have two meanings in ALU we use rB to read and after that to write data to the
memory.
Diagram:
Pipeline hazards:
Introducing pipelining into a system with feedback can lead to problems when there are
dependencies between successive instructions. We must resolve this issue before we can
complete our design. These dependencies can take two forms:
Diagram:
Pipelined execution of prog2 using stalls. After decoding the addl instruction in cycle 6, the stall
control logic detects a data hazard due to the pending write to register %eax in the write-back
stage. It injects a bubble into execute stage and repeats the decoding of the addl instruction in
cycle 7. In effect, the machine has dynamically inserted a nop instruction,
Detailed Mechanism:
This logic must handle the following four control cases for which other mechanisms, such as
data forwarding and branch prediction, do not suffice:
Processing ret:
The pipeline must stall until the ret instruction reaches the write-back stage.
Load/use hazards:
The pipeline must stall for one cycle between an instruction that reads a value from memory and
an instruction that uses this value.
Mispredicted branches:
By the time the branch logic detects that a jump should not have been taken, several instructions
at the branch target will have started down the pipeline. These instructions must be removed
from the pipeline.
Exceptions:
When an instruction causes an exception, we want to disable the updating of the programmer-
visible state by later instructions and halt execution once the excepting instruction reaches the
write-back stage.
Explanation:
The decode-stage logic detects that register %eax is the source register for operand valB, and that
there is also a pending write to %eax on write port E. It can therefore avoid stalling by simply
using the data word supplied to port E (signal W_valE) as the value for operand valB. This
technique of passing a result value directly from one pipeline stage to an earlier one is commonly
known as data forwarding (or simply forwarding, and sometimes bypassing). It allows the
instructions of above given program to proceed through the pipeline without any stalling. Data
forwarding requires adding additional data connections and control logic to the basic hardware
structure.
Pipelined execution using forwarding. In cycle 6, the decodestage logic detects the presence of a
pending write to register %eax in the write-back stage. It uses this value for source operand valB
rather than the value read from the register file.
Pipelined execution using forwarding. In cycle 5, the decode stage logic detects a pending write to
register %edx in the write-back stage and to register %eax in the memory stage. It uses these as the
values for valA and valB rather than the values read from the register file.
data forwarding can also be used when there is a pending write to a register in the memory stage,
avoiding the need to stall for program given. In cycle 5, the decode-stage logic detects a pending
write to register %edx on port E in the write-back stage, as well as a pending write to register
%eax that is on its way to port E but is still in the memory stage. Rather than stalling until the
writes have occurred, it can use the value in the write-back stage (signal W_valE) for operand
valA and the value in the memory stage (signal M_valE) for operand valB.