Computer Architecture
Computer Architecture
UNIT II ARITHMETIC 9
Fixed point Addition, Subtraction, Multiplication and Division. Floating Point arithmetic, High performance
arithmetic, Sub word parallelism
Parallel process ing architects res and challenges, Har‹tware multith reading, Multicore anlt shared memory
multiprocessors, Intro du ction to Gra ph ics Pmce.ssing Unit.s, Clusters a nd Warehouse scale computers -
Introduction to Multiprocessor network topologies.
TOTAL: 45 PERIODS
EC8552- Computer Architecture And Organization
TEXT BOOKS:
1. David A. Paterson and John L. Hennessey, Computer 0 i ganization and Design ||, Fifth edition, Morgan
Kauffman / Elsevier, 2014. (UNIT I -V)
2. Miles |. Murdocca and Vincent P. Heuring, Computer Architecture and Organization: An
Integrated approach ||, Secon‹t editio n, Wiley lndia Pvt Ltd, 2015 (U N IT IV,V)
REFERENCES
1. V. Carl Hamacher, Zvonko fi. Varanesic and Safat fi. Zaky, -Computer Organiza tio n —, Fifth edition, Mc firaw-
Hill Education India Pvt Ltd, 2014.2. William S t a I l i n g s Computer Organization a n d A r c h i t e c t u r e I .
S e v e nth E d i t i o n , P e a r s o n
Education, 2006.
UNIT – I
COMPUTER ORGANIZATION & INSTRUCTIONS
1.1 INTRODUCTION
Computer architecture acts as the interface between the hardware and the lowest level software.
Computer architecture refers to:
Attributes of a system visible to programmers like data type of variables.
Attributes that have a direct impact on the execution of programs like clock cycle.
Computer Architecture is defined as study of the structure, behavior,
and design of computers.
Computer Organization: It refers to the operational units and their interconnections that
realize the architectural specifications. It describes the function of and design of the various
units of digital computer that store and process information. The attributes in computer
organization refers to:
Control signals
Computer/peripheral interface
Memory technology
Computer hardware: Consists of electronic circuits, displays, magnetic and optical storage
media, electromechanical equipment and communication facilities.
Computer Architecture: It is concerned with the structure and behavior of the computer.
It includes the information formats, the instruction set and techniques for addressing memory.
The attributes in computer architecture refers to the:
Instruction set
Data representation
I/O mechanisms
Addressing techniques
The basic distinction between architecture and organization is: the attributes of the
former are visible to programmers whereas the attributes of the later describes how features are
implemented in the system.
EC8552- Computer Architecture And Organization
The modern day computer system╆s functional unit is given by Von Neumann
Architecture.
1.3 Introduction
1. Primary Memory:
It is a fast memory that operates at electronic speeds. Programs must be stored in
the memory while they are being executed. The memory contains large no of semiconductor
storage cells. Each cell carries 1 bit of information. The cells are processed in a group of
fixed size called Words. To provide easy access to any word in a memory, a distinct address
is associated with each word location. Addresses are numbers that identify successive
locations. The number of bits in each word is called the word length. The word length
ranges from 16 to 64 bits. There are 3 types of primary memory:
I. RAM: Memory in which any location can be reached in short and fixed amount of
time after specifying its address is called RAM. Time required to
access 1 word is called Memory Access Time.
II. Cache Memory: The small, fast, RAM units are called Cache. They are
tightly coupled with processor to achieve high performance.
III. Main Memory: The largest and the slowest unit is the main memory.
Arithmetic & Logic Unit
Most computer operations are executed in ALU. The arithmetic-logic section performs
arithmetic operations, such as addition, subtraction, multiplication, and division. Through
internal logic capability, it tests various conditions encountered during processing and takes
action based on the result. Data may be transferred back and forth between these two sections
several times before processing is completed. Access time to registers is faster than access time
to the fastest cache unit in memory.
Output Unit
Its function is to send the processed results to the outside world.
Control Unit
The operations of Input unit, output unit, ALU are co-ordinate by the control unit. The
control unit is the Nerve centre that sends control signals to other units and senses their states.
The control section directs the flow of traffic (operations) and data. It also maintains order
within the computer. The control section selects one program statement at a time from the
program storage area, interprets the statement, and sends the appropriate electronic impulses
to the arithmetic-logic and storage sections so they can carry out the instructions. The control
section does not perform actual processing operations on the data.
EC8552- Computer Architecture And Organization
The control section instructs the input device on when to start and stop transferring data
to the input storage area. It also tells the output device when to start and stop receiving data
from the output storage area. Data transfers between the processor and the memory are
controlled by the control unit through timing signals. Information stored in the memory is
fetched, under program control into an arithmetic and logic unit, where it is processed.
1.2.1 Evolution of Computers
The word ╅computer╆ is an old word that has changed its meaning several times in the
last few centuries.
Today, the word computer refers to computing devices, whether or not they are
electronic, programmable, or capable of ╅storing and retrieving╆ data.
The Mechanical Era (1623-1945)
Wilhelm Schick hard, Blaise Pascal, and Gottfried Leibnitz were among mathematicians
who designed and implemented calculators that were capable of addition, subtraction,
multiplication, and division during the seventeenth century.
The first multi-purpose or programmable computing device was probably Charles
Babbage’s Difference Engine, which was begun in 1823 but never completed.
In 1842, Babbage designed a more ambitious machine, called the Analytical Engine but
unfortunately it also was only partially completed.
Babbage, together with Ada Lovelace recognized several important programming
techniques, including conditional branches, iterative loops and index variables.
Babbage designed the machine which is the first to be used in computational science.
In 1933, George Scheutz and his son, Edvard began work on a smaller version of the
difference engine and by 1853 they had constructed a machine that could process 15-
digit numbers and calculate fourth-order differences.
The US Census Bureau was one of the first organizations to use the mechanical
computers which used punch-card equipment designed by Herman Hollerith to tabulate
data for the 1890 census.
1.5 Introduction
1.7 Introduction
The first ICs were based on small-scale integration (SSI) circuits, which had around 10
devices per circuit ゅor ╅chip╆ょ, and evolved to the use of medium-scale integrated (MSI)
circuits, which had up to 100 devices per chip.
Multilayered printed circuits were developed and core memory was replaced by faster,
solid state memories.
In 1964, Seymour Cray developed the CDC 6600, which was the first architecture to use
functional parallelism.
By using 10 separate functional units that could operate simultaneously and 32
independent memory banks, the CDC 6600 was able to attain a computation rate of one
million floating point operations per second (Mflops).
Five years later CDC released the 7600, also developed by Seymour Cray.
The CDC 7600, with its pipelined functional units, is considered to be the first vector
processor and was capable of executing at ten Mflops.
The IBM 360/91, released during the same period, was roughly twice as fast as the CDC
660.
Early in this third generation, Cambridge University and the University of London
cooperated in the development of CPL (Combined Programming Language, 1963).
CPL was an attempt to capture only the important features of the complicated and
sophisticated ALGOL.
However, like ALGOL, CPL was large with many features that were hard to learn.
In an attempt at further simplification, Martin Richards of Cambridge developed a
subset of CPL called BCPL (Basic Computer Programming Language, 1967).
In 1970 Ken Thompson of Bell Labs developed yet another simplification of CPL called
simply B, in connection with an early implementation of the UNIX operating system.
Fourth Generation (1972-1984)
Large scale integration (LSI - 1000 devices per chip) and very large scale integration
(VLSI - 100,000 devices per chip) were used in the construction of the fourth generation
computers.
Whole processors could now fit onto a single chip, and for simple systems the entire
computer (processor, main memory, and I/O controllers) could fit on one chip.
EC8552- Computer Architecture And Organization
1.9 Introduction
However Sequent provided a library of subroutines that would allow programmers to
write programs that would use more than one processor, and the machine was widely
used to explore parallel algorithms and programming techniques.
The Intel iPSC-1, also known as ╅the hypercube╆ connected each processor to its own
memory and used a network interface to connect processors.
This distributed memory architecture meant memory was no longer a problem and
large systems with more processors (as many as 128) could be built.
Also introduced was a machine, known as a data-parallel or SIMD where there were
several thousand very simple processors which work under the direction of a single
control unit.
Both wide area network (WAN) and local area network (LAN) technology developed
rapidly.
Sixth Generation (1990 - )
Most of the developments in computer systems since 1990 have not been fundamental
changes but have been gradual improvements over established systems.
This generation brought about gains in parallel computing in both the hardware and in
improved understanding of how to develop algorithms to exploit parallel architectures.
Workstation technology continued to improve, with processor designs now using a
combination of RISC, pipelining, and parallel processing.
Wide area networks, network bandwidth and speed of operation and networking
capabilities have kept developing tremendously.
Personal computers (PCs) now operate with Gigabit per second processors, multi-
Gigabyte disks, hundreds of Mbytes of RAM, color printers, high-resolution graphic
monitors, stereo sound cards and graphical user interfaces.
Thousands of software (operating systems and application software) are existing today
and Microsoft Inc. has been a major contributor. Microsoft is said to be one of the
biggest companies ever, and its chairman – Bill Gates has been rated as the richest man
for several years.
EC8552- Computer Architecture And Organization
Finally, this generation has brought about micro controller technology. Micro
controllers are ╆embedded╆ inside some other devices so that they can control the
features or actions of the product.
They work as small computers inside devices and now serve as essential components in
most machines.
な. Moore’s Law
1.11 Introduction
Moore’s law states that the numbers of transistors will double every 18 months.
It is an observation that the number of transistors in a dense integrated circuit doubles
about every two years. It is an observation and projection of a historical trend and not a physical
or natural law.
2. Abstract Design
It is a major productivity technique for hardware and software. Abstractions are used to
represent the design at different levels of representation. The detailed lower-level design details
from the higher levels.
3. Performance through parallelism
Parallelism executes programs faster by performing several computations at the same
time. This requires hardware with multiple processing units. The overall performance of the
system is significantly increased by performing operations in parallel.
4. Performance through Pipelining
Pipelining increases the CPU instruction throughput. Throughput is a performance
metric which is the number of instructions completed per unit of time. But it does not reduce the
execution time of an individual instruction. It increases the execution time of each instruction
due to overhead in the pipeline control. The increase in instruction throughput means that a
program runs faster and has lower total execution time.
5. Make the Common Case Fast
Making the common case fast will tend to enhance performance better than optimizing
the rare case. Ironically, the common case is often simpler than the rare case and hence is often
easier to enhance. In making a design trade-off, favor the frequent case over the infrequent case.
Amdahl╆s Law can be used to quantify this principle. This also applies when determining how to
spend resources, since the impact on making some occurrence faster is higher if the occurrence
is frequent. This will:
Helps performance
Is simpler and can be done faster
6. Performance via prediction
The computer can perform better (on average) by making rational guesses on the
decisions. Instead of wasting clock cycles for certain results, the computers can remarkably
improve the performance
7. Hierarchy of memories
Programmers want memory to be fast, large, and cheap. The memory speed is a primary
factor in determining the performance of the system. The memory capacity limits the size of
problems that can be solved.
EC8552- Computer Architecture And Organization
1.13 Introduction
After being sliced from the silicon ingot, blank wafers are put through 20 to 40 steps to
create patterned wafers. These patterned wafers are then tested with a wafer tester, and a map
of the good parts is made. Then, the wafers are diced into dies. The good dies are then bonded
into packages and tested one more time before shipping the packaged parts to customers.
Cost of an IC is found from:
Cost per die= (cost per wafer) / ((dies per wafer)*yield) Yield refers the fraction
of dies that pass testing.
Dies / wafer= wafer area / die area
Yield=1 / (1 + (defects per area * die area)/2 )2
Programmable Logic Device (PLD)
A programmable logic device (PLD) is an electronic component used to build
reconfigurable digital circuits. Unlike a logic gate, which has a fixed function, a PLD has an
undefined function at the time of manufacture. Before the PLD can be used in a circuit it must be
programmed, that is, reconfigured.
The major limitations of PLD:
Consume space due to large number of switches for programmability
Low speed due to the presence of many switches.
Custom chips
An Application-Specific Integrated Circuit (ASIC) is an integrated circuit (IC) customized
for a particular use, rather than intended for general-purpose use. Application-Specific Standard
Products (ASSPs) are intermediate between ASICs and industry standard integrated circuits.
1.2.4 Performance
Elapsed time and throughput are two different ways of measuring speed.
Elapsed time or wall-clock time or response time is the total time to complete a task,
including disk accesses, memory accesses, input/output (I/O) activities, operating
system overhead. It is the better measure for processor speed because it is less
dependent on other system components.
CPU execution time is the actual time the CPU spends computing for a specific task.
The User CPU time is the CPU time spent in a program itself. System CPU time is the
CPU time spent in the operating system performing tasks on behalf of the program.
The CPU Performance equation (CPU Time) is the product of number of instructions
executed, Average CPI of the program and CPU clock cycle.
1.15 Introduction
Cycles per Instruction (CPI) is count of clock cycles taken by an instruction to complete
its execution.
The use of benchmarks whose performance depends on very small code segments
encourages optimizations in either the architecture or compiler that target these
segments.
The arithmetic mean is proportional to execution time, assuming that the programs in
the workload are each run an equal number of times.
Weighted arithmetic mean is an average of the execution time of a workload with
weighting factors designed to reflect the presence of the programs in a workload;
computed as the sum of the products of weight.
Example 1.1: For a given program, the execution time on machine A is 1s and on B is 10s.
Find the performance or speed up of the machines.
Execution A= 1s
Execution B=10s
Performance of A Execution of B
Speedup = x
Performance of B Execution of A
Speedup=10/1=10
The performance of machine A is 10 times faster than that of B.
Example 1.2: For a certain program with 1,00,00,000 instructions, find the execution time given
the average CPI is 2.5 cycles/instruction and clock rate as 200MHz.
Number of instructions=1,00,00,000
Average CPI=2.5 cycles/ instruction
Clock rate=200MHz =200000000 Hz
Clock cycle=1/Clock rate=1/ 200000000= 5 x 10-9s
1.7 Introduction
Example 1.3: For a certain program with 1,00,00,000 instructions has an average CPI is 2.5
cycles/instruction and clock rate as 200MHz. When a new optimization complier is deployed, the
instruction count was reduced to 95,00,000 with new CPI=3.0 cycles/instruction at modified
clock rate of 300MHz. Find the speedup.
Old Execution Time I old x CPI old x Clock cycle old
Speedup = x
New Execution Time I new x CPI new x Clock Cycle new
(10000000 x 2.5 x 5 x 10-9) / (9500000 x 3 x 3.33 x 10-9)
1.315
The new compiler is 1.315 times faster than the old one.
Example 1.4: A program runs in 10 seconds on computer A, which has a 2 GHz clock. We are
trying to help a computer designer build a computer, B, which with run this program in 6
seconds. The designer has determined that a substantial increase in the clock rate is possible,
but this increase will affect the rest of the CPU design, causing computer B to require 1.2 time as
many clock cycles as computer A for this program. What clack rate should we tell the designer to
target?
Clock rate of B= Clock Cycles B / CPU Time B
10 x 2 = 20 x 109
Clock Cycles B = 1.2 x 20 x 109 / 6
= 4 GHz
Example 1.5: Suppose we have two implementations of the same instruction set architecture.
Computer A has a clock cycle time of 250ps and a CPI of 2.0 for some program, and computer B
has a clock cycle time of 500ps and a CPI of 1.2 for the same program. Which computer is faster
for this program and by how much?
Computer A: Cycle Time = 250ps, CPI = 2.0
Computer B: Cycle Time = 500ps, CPI = 1.2
EC8552- Computer Architecture And Organization
1.17 Introduction
An increase in the power density increases the chip temperature, which slows down the
transistor switching rate and hence, the overall speed of the computer.
Cooling solutions are very expensive, and hence, computer architects have focused on
innovating device, circuit and architecture level techniques to combat power wall.
Dynamic voltage and frequency scaling are solutions for these problems. Here the
operating voltage and frequency of the chip are dynamically controlled based on the chip
activity.
In CMOS (complementary metal oxide semiconductor) IC technology
Example 1.6: Suppose we developed a new, simpler processor that has 85% of the capacitive
load of the more complex older processor. Further, assume that it has adjustable voltage so that
it can reduce voltage 1 5% compared to processor B, which results in 15% shrink in frequency.
What is the impact on dynamic power? Given: 85% of capacitive load of old CPU, 15% voltage
reduction, 15% frequency reduction
The new processor uses 0.52 the power of the old processor.
Machines based on an SIMD model are well suited to scientific computing since they
involve lots of vector and matrix operations. So that the information can be passed to all
the Processing Elements (PEs) organized data elements of vectors can be divided into
multiple sets and each PE can process one data set.
Multiple Instruction, Multiple Data (MIMD):
This is capable of executing multiple instructions on multiple data sets.
Each PE in the MIMD model has separate instruction and data streams; therefore
machines built using this model are capable to any kind of application.
Unlike SIMD and MISD machines, PEs in MIMD machines work asynchronously.
EC8552- Computer Architecture And Organization
1.21 Introduction
Immediate Addressing:
This is the simplest form of addressing. Here, the operand is given in the instruction.
This mode is used to define constant or set initial values of variables.
The advantage of this mode is that no memory reference other than instruction fetch is
required to obtain operand.
The disadvantage is that the size of the number is limited to the size of the address field
because most instruction sets is small compared to word length.
EC8552- Computer Architecture And Organization
1.24 Introduction
Example: ADD 3
Direct Addressing:
In direct addressing mode, effective address of the operand is given in the address field
of the instruction.
It requires one memory reference to read the operand from the given location and
provides only a limited address space.
Length of the address field is usually less than the word length.
Example : Move P, Ro
Add Q, Ro
Where P and Q are the address of operand, Ro is any register. Sometimes
Accumulator (AC) is the default register. Then the instruction will look like:
Add A
Register Addressing:
Register addressing mode is similar to direct addressing. The only difference is that the
address field of the instruction refers to a register rather than a memory location.
EC8552- Computer Architecture And Organization
1.25 Introduction
3 or 4 bits are used as address field in the instruction to refer 8 to 16 generate purpose
registers (GPR).
The operands are in registers that reside within the CPU.
The instruction specifies a register in CPU, which contain the operand.
There is no need to compute the actual address as the operand is in a register and to get
operand there is no memory access involved.
The advantages of register addressing are small address field is needed in the instruction
and faster instruction fetch.
The disadvantages includes very limited address space and usage of multiple registers
helps in performance but it complicates the instructions.
Example: MOV AX, BX
Displacement Addressing:
It is a combination of direct addressing or register indirect addressing mode.
Displacement Addressing Modes requires that the instruction have two address fields,
at least one of which is explicit means, one is address field indicate direct address and
other indicate indirect address.
Value contained in one addressing field is A, which is used directly and the value in
other address field is R, which refers to a register whose contents are to be added to
produce effective address.
Example: EA=A+(R)
EC8552- Computer Architecture And Organization
1.27 Introduction
Indexed addressing:
The content of Index Register is added to direct address part of instruction to
obtain the effective address. The register indirect addressing field of instruction point to
Index Register, which is a special CPU register that contain an Indexed value, and direct
addressing field contain base address.
The data array is in memory and each operand in the array is stored in memory
relative to base address. The distance between the beginning address and the address of
operand is the indexed value stored in indexed register.
Any operand in the array can be accessed with the same instruction, which
provided that the index register contains the correct index value i.e., the index register
can be incremented to facilitate access to consecutive operands.
1.29 Introduction
Stack Addressing:
Stack is a linear array of locations referred to as last-in first out queue.
The stack is a reserved block of location, appended or deleted only at the top of the stack.
Stack pointer is a register which stores the address of top of stack location.
This mode of addressing is also known as implicit addressing.
Example: Add
This instruction pops two items from the stack and adds.
Additional Modes:
Auto-increment mode:
Auto-increment Addressing Mode are similar to Register Indirect Addressing Mode
except that the register is incremented after its value is loaded (or accessed) at another
location like accumulator (AC).
The Effective Address of the operand is the contents of a register in the instruction.
After accessing the operand, the contents of this register is automatically incremented to
point to the next item in the list.
Example: (R) +.
The contents in register R will be accessed and them it will be incremented to point the
next item in the list.
1.34 Introduction
Example: - ( R)
The contents in register R will be decremented and then it is accessed.
1.4 INSTRUCTIONS
An instruction is a binary code, which specifies a basic operation for the computer.
Operation Code (op code) defines the operation type. Operands define the operation
source and destination.
Instruction Set Architecture (ISA) describes the processor in terms of what the
assembly language programmer sees, i.e. the instructions and registers.
The op codes and operands follows Stores Program Concept.
Stored Program Concept is an idea that instructions and data of many types can be stored in
memory as numbers, leading to the stored program computer.
1.4.1 Operations
The computer performs the arithmetic through operations.
The MIPS arithmetic instruction performs only one operation and must always have
exactly three variables.
Example: Add a, b, c
Adds b and c and stores the sum in a.
EC8552- Computer Architecture And Organization
The MIPS address is specified in part by a constant and in part by the contents of
a register.
EC8552- Computer Architecture And Organization
1.33 Introduction
Many programs have more variables than computers have registers. The
compiler tries to keep the most frequently used variables in registers and places
the rest in memory, using loads and stores to move variables between registers
and memory.
The process of putting less commonly used variables into memory is called
spilling registers.
Constant or Immediate Operands
230 memory Memory[0], The contents can be accessed only after data
words Memory [1]… transfer instructions. MIPS use byte addressing.
A bit or binary digit is a single digit of a binary number and is the smallest indivisible unit of
computing.
The binary digit may be used to denote high or low, on or off, true or false, or 1 or 0.
Registers are part of every instruction, hence there must be a convention to map register
names into numbers.
Example: add $t0,$s1,$s2.
This instruction is mapped to its equivalent decimal representation as: The binary
Op code is the field that denotes the operation and format of an instruction.
EC8552- Computer Architecture And Organization
1.35 Introduction
The first class of such operations is called shifts. They move all the bits in a word to the
left or right, filling the emptied bits with 0s.
0000 0000 0000 00000 000 0000 0000 0000 10012= 910 After left shifting by
four, the new value is 144.
0000 0000 0000 0000 0000 0000 0000 1001 00002= 14410
EC8552- Computer Architecture And Organization
1.37 Introduction
Left shift: Left shifting by i bits is equivalent to multiplying the number by 2i.
Right Shift: Right shifting by i bits is equivalent to dividing the number by 2i.
AND: This is used in masking of bits.
OR: It is a bit-by-bit operation that places a 1 in the result if either operand bit is a 1
NOT: A logical bit-by-bit operation with one operand that inverts the bits; that is, it
replaces every 1 with a 0, and every 0 with a 1.
NOR: A logical bit-by-bit operation with two operands that calculates the NOT of the OR
of the two operands.
Example:
Consider the following statement,
if (i == j) f = g + h; else f = g – h;
1.39 Introduction
This instruction says that the processor always follows the branch. To distinguish
between conditional and unconditional branches, the MIPS name for this type of instruction is
jump, abbreviated as j (the label Exit is defined below).
j Exit # go to Exit
The assignment statement in the else portion of if statement can again be compiled
into a single instruction. We just need to append the label Else to this instruction. We also show
the label Exit that is after this instruction, showing the end of the if-then-else compiled code:
Else: sub $s0, $ s1, $s2 # f = g – h (skipped if i = j)
Exit:
Compilers create branches and labels wherever necessary for maintaining flow of the
program. Also, the assembler calculates the addresses and relieves the compiler and the
assembly language programmer.
Looping:
When a set of statements has to be executed more number of times, looping statements
are used.
Example:
while (save[i] == k)
i += 1;
i and k correspond to registers $s3 and $s5 and the base of the array save is in $s6. The MIPS
instructions are:
The first step is to load save[i] into a temporary register. This operation needs an
address. Multiply the index i by 4 and add i to the base of array to obtain the address.
Add the label Loop to it to branch back to that instruction at the end of the loop: Loop:
sll $t1,$s3,2 # Temp reg $t1 = 4 * i
To get the address of save[i] , add $t1 and the base of save in $s6:
add $t1,$t1,$s6 # $t1 = address of save[i]
Use that address to load save[i] into a temporary register:
lw $t0,0($t1) # Temp reg $t0 = save[i]
D
EC8552- Computer Architecture And Organization
immediate)
Unconditional branch J L Goto L
(Jump to target address L)
EC8552- Computer Architecture And Organization
UNIT - II
ARITHMETIC
2.1 INTRODUCTION
Data is manipulated by using the arithmetic instructions in digital computers to give
solution for the computation problems. The addition, subtraction, multiplication and division
are the four basic arithmetic operations. Arithmetic processing unit is responsible for
executing these operations and it is located in central processing unit.
The arithmetic instructions are performed on binary or decimal data. Fixed-point
numbers are used to represent integers or fractions. These numbers can be signed or
unsigned negative numbers. A wide range of arithmetic operations can be derived from the
basic operations.
Examples:
+3 0011
-3 1011
0 0000
-0 1011
5 0101
-5 1101
The one╆s complement of a negative binary number is the complement of its positive
counterpart, so to take the one╆s complement of a binary number.
Number One’s complement Representation
00001000 (+8) 11110111
10001000(-8) 01110111
00001100(+12) 11110011
10001100(-12) 01110011
2.4 Arithmetic
)n two╆s complement form, a negative number is the に╆s complement of its positive number with
the subtraction of two numbers being A – B = A + ゅ に╆s complement of B ょ using much the same
process as before as basically, two╆s complement is adding な to one╆s
complement of the number.
The main difference between 12 s complement and 22 s complement is that 12 s
complement has two representations of 0 (+0): 00000000, and (-0): 11111111. In 22 s
complement, there is only one representation for zero: 00000000 (0).
+0: 00000000
に╆s complement of -0:
-0: 00000000 (Signed magnitude representation)
-に=な╆s complement of に +な
1111 1111 1111 11どな ゅな╆s complement of にょ + 1
= 1111 1111 1111 1110 (16 bits)
= 1111 1111 1111 1111 1111 1111 1111 1110 (32 bits)
To convert to 32 bit number copy the digit in the MSB of the 16 bit number for 16 times and
fill the left half.
EC8552- Computer Architecture And Organization
A fixed-point number representation is a real data type for a number that has a fixed
number of digits after the radix point or decimal point.
This is a common method of integer representation is sign and magnitude
representation. One bit is used for denoting the sign and the remaining bits denote the
magnitude. With 7 bits reserved for the magnitude, the largest and smallest numbers
represented are +127 and –127. Fixed-point numbers are useful for representing fractional
values, usually in base 2 or base 10, when the executing processor has no floating point unit
(FPU) or if fixed-point provides improved performance or accuracy for the application at
hand. Most low-cost embedded microprocessors and microcontrollers do not have an FPU.
A value of a fixed-point data type is essentially an integer that is scaled by a specific
factor. The scaling factor is usually a power of 10 (for human convenience) or a power of 2
(for computational efficiency). However, other scaling factors may be used occasionally, e.g. a
time value in hours may be represented as a fixed-point type with a scale factor of 1/3600 to
obtain values with one-second accuracy. The maximum value of a fixed-point type is the
largest value that can be represented in the underlying integer type, multiplied by the scaling
factor; and similarly for the minimum value.
Example:
The value 1.23 can be represented as 1230 in a fixed-point data type with scaling
factor of 1/1000.
Precision loss and overflow
The fixed point operations can produce results that have more bits than the operands
there is possibility for information loss.
In order to fit the result into the same number of bits as the operands, the answer
must be rounded or truncated.
Fractional bits lost below this value represent a precision loss which is common in
fractional multiplication.
If any integer bits are lost, however, the value will be radically inaccurate.
Some operations, like divide, often have built-in result limiting so that any positive
overflow results in the largest possible number that can be represented by the
current format.
EC8552- Computer Architecture And Organization
2.4 Arithmetic
Likewise, negative overflow results in the largest negative number represented by
the current format. This built in limiting is often referred to as saturation.
Some processors support a hardware overflow flag that can generate an exception on
the occurrence of an overflow, but it is usually too late to salvage the proper result at
this point.
2.2.1 Addition and Subtraction
In addition, the digits are added bit by bit from right to left, with carries passed to the
next digit to the left. Subtraction operation is also done using addition: The appropriate
operand is simply negated before being added.
a) Addition b) Subtraction
2.7 Arithmetic
The MIPS instructions for addition and subtraction are given in the following table:
2.9 Arithmetic
Booth’s Algorithm:
Booth algorithm gives a procedure for multiplying binary integers in signed- に╆s
complement representation. )t operates on the fact that strings of ど╆s in the multiplier require no addition but
just shifting, and a string of な╆s in the multiplier from bit weight に k to weight 2m can be treated as 2k+1–
2 m.
For example, the binary number 001110 (+14) has a string な╆s from にぬ to にな ゅk=ぬ,
m=1). The number can be represented as 2k+1– 2m = 24 – 21 = 16 – 2 = 14. Therefore,
the multiplication M X 14, where M is the multiplicand and 14 the multiplier, can be done as
M X 24 – M X 21. Thus the product can be obtained by shifting the binary multiplicand M four
times to the left and subtracting M shifted left once.
Booth algorithm requires examination of the multiplier bits and shifting of partial
product. Prior to the shifting, the multiplicand may be added to the partial product,
subtracted
From the partial, or left unchanged according to the following rules:
1. The multiplicand is subtracted from the partial product upon encountering the first
least significant な in a string of な╆s in the multiplier.
2. The multiplicand is added to the partial product upon encountering the first 0 in a
string of ど╆s in the multiplier.
3. The partial product does not change when multiplier bit is identical to the previous
multiplier bit.
The algorithm works for positive or negative multipliers in に╆s complement
representation. This is because a negative multiplier ends with a string of な╆s and the last
operation will be a subtraction of the appropriate weight. The two bits of the multiplier in on
and Qn+1 are inspected. If the two bits are equal to 10, it means that the first 1 in a string of
1╅s has been encountered. This requires a subtraction of the multiplicand from the partial
product in AC. )f the two bits are equal to どな, it means that the first ど in a string of ど╆s has
been encountered. This requires the addition of the multiplicand to the partial product in AC.
When the two bits are equal, the partial product does not change.
EC8552- Computer Architecture And Organization
2.9 Arithmetic
2.13 Arithmetic
Quotient=0010Remainder=0001
EC8552- Computer Architecture And Organization
2.14 Arithmetic
With this fractional number system, we can represent the fractional numbers in the following
range,
The binary point is said to be float and the numbers are called floating point
numbers. The position of binary point in floating point numbers is variable and hence
numbers must be represented in the specific manner is referred to as floating point
representation. The floating point representation has three fields. They are:
Sign: Sign bit is the first bit of the binary representation. ╅ な ╆ implies negative number and ╅ど╆ implies
positive number.
Example: 11000001110100000000000000000001. This is negative number
since it starts with 1.
Exponent: It starts from bit next to the sign bit of the binary representation. The
exponent field is needed to represent both positive and negative exponents. To do this,
a bias is added to the actual exponent in order to get the stored exponent. For IEEE
single-precision floats, this value is 127. Thus, to express an exponent of zero, 127 is
stored in the exponent field. A stored value of 200 indicates an exponent of (200"127),
or ばぬ. The exponents of ╉なにば ゅall どsょ and +なにぱ ゅall なs) are reserved for special
numbers.
Double precision has an 11-bit exponent field, with a bias of 1023.Example: For 8 bit
conversion: 8 =23-1-1=3. Bias=3.
2.18 Arithmetic
Precision bits of the number. It is composed of an implicit leading bit (left of the radix
point) and the fraction bits (to the right of the radix point). To find out the value of the
implicit leading bit, consider that any number can be expressed in scientific notation
in many different ways.
Example: 50 can be represented as
1. 0.050 × 103
2. .5000 × 103
5.000 × 101
50.00 × 100
5000. × 10-2
In order to maximize the quantity of representable numbers, floating-point numbers
are typically stored in normalized form. This basically puts the radix point after the
first non-zero digit. In normalized form, 50 is represented as 5.000 × 101.
Example 2.11. Find the decimal equivalent of the floating point number:
01000001110100000000000000000000
Sign=0
Exponent:
10000011=13110
131-127=4
Exponent= 24=16
Mantissa:
Remaining 23 bits: 10100000000000000000000
=1*(1/2) + 0*(1/4) + 1*(1/8ょ + ど*ゅな/なはょ +……… = ど.はにの Decimal number= Sign * Exponent *
Mantissa
=-1 * 16 *0.625 = -26
Example 2.11: Find the floating point equivalent of -17.
Sign=1 (-ve number)
Exponent:
Bias for 32 bit = 127 (28-1 -1 = 127) 127 + 4 = 131=100000112
Mantissa:
17 = 100012=1.0001 x 24
Fractional part=00010000000000000000000 -17 =1 10000011
000100000000000000000002
Terminologies:
Overflow: A situation in which a positive exponent becomes too large to fit in the
exponent field.
Underflow: A situation in which a negative exponent becomes too large to fit in
the exponent field.
Double precision: A floating point value represented in two 32-bit words.
EC8552- Computer Architecture And Organization
2.19 Arithmetic
Single precision: A floating point value represented in a single 32-bit word.
Example 2.12: The IEEE-754 32-bit floating-point representation pattern is 0 10000000 110
0000 0000 0000 0000 0000. What is the number?
Sign bit S = 0 (positive number)
Exponent E = 100000002 = 12810 (in normalized form)
Fraction is 1.112 (with an implicit leading 1) = 1 + 1×2-1 + 1×2-2 = 1.7510
The number is +1.75 × 2 (128-127) = +3.510
85.125 = 1010101.001
=1.010101001 x 26
Sign = 0
1. Single precision:
Biased exponent 127+6=133
133 = 10000101
Normalized mantisa = 010101001
The IEEE 754 Single precision = 0 10000101 01010100100000000000000
2. Double precision:
Biased exponent 1023+6=1029
1029 = 10000000101
Normalized mantisa = 010101001
The IEEE 754 Double precision=
0 10000000101 0101010010000000000000000000000000000000000000000000
2.19 Arithmetic
The addition operation proceeds as the exponent of one operand is subtracted from
the other using the small ALU to determine which is larger and by how much. This difference
controls the three multiplexors; from left to right, they select the larger exponent, the
significant of the smaller number, and the significant of the larger number. The smaller
significant is shifted right, and then the significant are added together using the big ALU.
EC8552- Computer Architecture And Organization
2.23 Arithmetic
The normalization step then shifts the sum left or right and increments or decrements the
exponent. Rounding then creates the final result, which may require normalizing again to
produce the final result.
-126 <= -4 <= 127 (-4 is within the range of -126 and 127).No overflow or underflow
Step 4: The sum fits in 4 bits so rounding is not required
Example 2.17: Express the following numbers in IEEE 754 format and find their sum:
2345.125 and 0.75.Single precision format of 2345.125:
The result is +ve hence 0 is filled in the sign field. The exponent value of 2345.125 is copied in
the exponent field of the result, since the 0.75 is adjusted to the exponent of 2345.125.
2.25 Arithmetic
Example 2.19: Multiply 1.110 x 1010 by 9.200 x 10-5. Express the product in 3 decimal places.
1. Add the exponents
Exponent of the product=10-5=5
Multiply the significant digits 1.110 x 9.200=10.212000
Normalize the product
10.212 x 105= 1.0212 x 106
4. Round-off
1.0212 x 106= 1.021 x 106
Example 2.21: Multiply -1.110 1000 0100 0000 10101 0001 x 2-4 and 1.100 0000 0001 0000
0000 0000 x 2-2.
1. Add the exponents
Exponent of the product=-4 + -2=-6 2. Multiply the significant digits
-1.110 1000 0100 0000 10101 0001 x 1.100 0000 0001 0000 0000 0000
= 10.1011100011111011111100110010100001000000000000
3. Normalize the product 1.01011100011111011111100110010100001000000000000 x 2-5
4. Round-off (Only 23 fraction bits)
1.01011100011111011111100x2-5
EC8552- Computer Architecture And Organization
2.27 Arithmetic
Arithmetic
Data movement (memory and registers)
Conditional jumps
Floating Point (FP) instructions work with a different bank of registers. Registers are named
f0 to $f31. MIPS floating-point registers are used in pairs for double precision numbers and
referred using even numbers. Single precision numbers end with .s and double precision
numbers end with .d.
Load word copr,1 Lwcl $f1, 100 ($s2) F1=memory[s2+100]32 bit data to
FP register
Store word copr,1 Swcl $f1, 100 ($s2) Memory[s2+100]=f132 bit data to
memory
FP compare single C.lt.s $f2, $f4 If(f2 < f4) Cond=1; else cond=0
(eq, ne, li, le, gt, ge)
FP compare double C.lt.d $f2, $f4 If(f2 < f4) Cond=1; else cond=0
(eq, ne, li, le, gt, ge)
The high performance adders takes an extra input namely the transit time.
The transmit time of a logical unit is used as a time base in comparing the operating
speeds of different methods, and the number of individual logical units required is
used in the comparison of costs.
The two multi-bit numbers being added together will be designated as A and B, with
individual bits being A1, A2, B1, etc. The third input will be C. Outputs will be S (sum) R
(carry), and T (transmit). The two multi bit numbers being added together will be designated
asA and B, with individual bits being A1, A2, B1, etc. The third input will be C. Outputs will be
S (sum) R (carry), and T (transmit).
The time required to perform an addition in conventional adder is dependent on the time
required for a carry originating in the first stage to ripple through all intervening stages
to the S or R output of the final stage. Using the transit time of a logical block as a unit of time,
this amounts to two levels to generate the carry in the first stage, plus two levels per stage for
transit through each intervening stage, plus two levels to form the sum in the final stage,
which gives a total of two times the number of stages.
Cn=Rn-1
Cn=Dn-1 || Tn-1 Rn-2
Cn=Dn-1 || Tn-1 Dn-2 || Tn-1Tn2 Rn-3
By allowing n to have successive values starting with one and omitting all terms
containing a a resulting negative subscript, it may be seen that each stage of the adder will
EC8552- Computer Architecture And Organization
2.29 Arithmetic
require one OR stage with n inputs and n AND circuits having one through n inputs, where
N is the position number of the particular stage under consideration.
The multiplier and the partial product will always be shifted the same amount and at
the same time.
The multiplier is shifted in relation to the decoder, and the partial product with
relation to the multiplicand.
Operation is assumed starting at the low-order end of the multiplier, which means
that shifting is to the right.
If the lowest-order bit of the multiplier is a one, it is treated as though it had been
approached by shifting across zeros.
Rules:
When shifting across zeros (from low order end of multiplier), stop at the first one.
a) If this one is followed immediately by a zero, add the multiplicand, then shift across
all following zeros.
b) If this one is followed immediately by a second one, subtract the multiplicand, then
shift across all following ones.
2. When shifting across ones (from low order end of multiplier), stop at the first zero.
a) If this zero is followed immediately by a one, subtract the multiplicand, then shift
b) If this zero is followed immediately by a second zero, add the multiplicand, then
A shift counter or some equivalent device must be provided to keep track of the
number of shifts and to recognize the completion of the multiplication.
EC8552- Computer Architecture And Organization
If the high-order bit of the multiplier is a one and is approached by shifting across
ones, that shift will be to the first zero beyond the end of the multiplier, and that zero
along with the bit in the next higher order position of the register will be decoded to
determine whether to add or subtract.
For this reason, if the multiplier is initially located in the part of the register in which
the product is to be developed, it should be so placed that there will be at least two
blank positions between the locations of the low-order bit of the partial product and
the high-order bit of the multiplier.
Otherwise the low-order bit of the product will be decoded as part of the multiplier.
Multiplication Using Uniform Shifts
Multiplication which uses shifts of uniform size and permits predicting the number of
cycles that will be required from the size of the multiplier is preferable to a method
that requires varying sizes of shifts.
The most important use of this method is in the application of carry-save adders to
multiplication although it can also be used for other applications.
Uniform shifts of two
Assume that the multiplier is divided into two-bit groups, an extra zero being added to
the high-order end, if necessary, to produce an even number of bits.
Only one addition or subtraction will be made for each group, and, using the position of
the low-order bit in the group as a reference, this addition or subtraction will consist of
either two times or four times the multiplicand.
These multiples may be obtained by shifting the position of entry of the multiplicand
into the adder one or two positions left from the reference position.
The last cycle of the multiplication may require special handling.
Following any addition or subtraction, the resulting partial product will be either
correct or larger than it should be by an amount equal to one times the multiplicand.
Thus, if the high-order pair of bits of the multiplier is 00 or 10, the multiplicand would
be multiplied by zero or two and added, which gives a correct partial product.
If the high-order pair of bits is 01 or 11, the multiplicand is multiplied by two or four,
EC8552- Computer Architecture And Organization
2.31 Arithmetic
not one or three, and added. This gives a partial product that is larger than it should
be, and the next add cycle must correct for this.
Following the addition the partial product is shifted left- two positions. This multiplies
it by four, which means that it is now larger than it should be by four times the
multiplicand.
This may be corrected during the next addition by subtracting the difference between
four and the desired multiplicand multiple.
Thus, if a pair ends in zero, the resulting partial product will be correct and the
following operation will be an addition.
If a pair ends in a one, the resulting partial product will be too large, and the following
operation will be a subtraction.
It can now be seen that the operation to be performed for any pair of bits of the
multiplier may be determined by examining that pair of bits plus the low-order bit of
the next higher-order pair.
If the bit of the higher-order pair is a zero, an addition will result; if it is one, a
subtraction will result. If the low-order bit of a pair is considered to have a value of one
and the high-order bit a value of two, then the multiple called for by a pair is the
numerical value of the pair if that value is even and one greater if it is odd.
If the operation is an addition, this multiple of the multiplicand is used. If the operation
is a subtraction (the low-order bit of the next higher order pair a one), this value is
combined with minus four to determine the correct multiple to use.
The result will be zero or negative, with a negative result meaning subtract instead of
add.
to the next-higher-order stage of the same adder, but goes to an intermediate register
or other device in the same manner as the sum (S) output.
A carry-save adder has three inputs which, as far as use is concerned, may be
considered identical, and two outputs which are not identical and must be treated in
different manners.
The procedure for adding several binary numbers by using a carry-save adder would be
as follows.
Designate the inputs for the nth bit as An, Bn, and C, and the outputs for the same bit as
Sn and R, where Sn is the sum output and R. is the carry output.
In the first cycle enter three of the input numbers into A, B, and C.
In the second cycle enter the S and R obtained from the previous cycle into A and B and
the fourth input number into C.
In this operation Sn goes into An, but Rn goes into Bn+1, where Bn+1isin the next higher-
order bit position than B.
This is continued until all of the input numbers have been entered into the adder.
Each add cycle advances all carries one position, add cycles as already described may be
continued with zeros being entered into the third input each time until the R outputs of
all stages become zero.
The alternative is to enter S and R into a carry-propagate adder and allow time for one
cycle through it.
This carry-propagate adder may be completely separate from the carry-save unit, or it
may be a combined unit with a control line for selecting either carry-save or carry-
propagate operation.
SUB WORD PARALLELISM
A sub word is a lower precision unit of data contained within a word. In sub word
parallelism, multiple sub words are packed into a word and then process whole
words.
With the appropriate sub word boundaries this technique results in parallel processing of sub
words. Since the same instruction is applied to all sub words within the word, this is a
EC8552- Computer Architecture And Organization
2.33 Arithmetic
form of SIMD(Single Instruction Multiple Data) processing. It is possible to apply sub word
parallelism to noncontiguous sub words of different sizes within a word. In practical
implementation is simple if sub words are same size and they are contiguous within a word.
The data parallel programs that benefit from sub word parallelism tend to process data that
are of the same size.
Example: If word size is 64bits and sub words sizes are 8,16 and 32 bits. Hence an
instruction operates on eight 8bit sub words, four 16bit sub words, two 32bit sub words or
one 64bit sub word in parallel.
5. Reduction operations that combine the packed sub words in a register into a single
value or a smaller set of values.
6. A way to clip higher precision numbers to fewer bits for storage or transmission.
7. The ability to move data between processor registers and memory, as well as the
ability to loop and branch to an arbitrary program location.
EC8552- Computer Architecture And Organization
UNIT - III
THE PROCESSOR
3.1 INTRODUCTION
i. Instruction count: This depends on the compiler used and instruction set
architecture.
ii. Clock cycle time: This depends on processor implementation.
MIPS (Million Instructions Per Second) is a simple, streamlined, highly scalable RISC
architecture with adopted by the industries.
Implementation of MIPS
IPS has 32 General purpose registers (GPR) or integer registers (64 bit) holding integer
data. Floating point registers (FPR) are also available in MIPS capable of holding both single
precision (32 bit) and double precision data (64 bit). The following are the data types available
for MIPS:
EC8552- Computer Architecture And Organization
i. Set the program counter (PC) to the address of the code and fetch the instruction
from that memory.
ii. Read one or two registers, using fields of the instruction to select the registers to
read. For the load word instruction, read only one register and for store word the
processor has to operate on two registers.
The ALU operations are done and the result of the operation is stored in the destination register
using store operation. When a branching operation is involved, then next address to be fetched
must be changes based on the branch target.
EC8552- Computer Architecture And Organization
Fig 3.1: Implementation of MIPS architecture with multiplexers and control lines
Sequence of operations
Program Counter (PC): This register contains the address (location) of the instruction
currently getting executed. The PC is incremented to read the next instruction to be
executed.
The operands in the instruction are fetched from the registers.
The ALU or branching operations are done. The results of the ALU operations are stored
in registers. If the result is given in load and store forms, then the results are written to
the memory address and from there they are transferred to the registers.
In case of branch instructions, the result of the branch operation is used to determine the
next instruction to be executed.
EC8552- Computer Architecture And Organization
Arithmetic/logical/shift/comparison
Control instructions (branch and jump)
Load/store
Other (exception, register movement to/from GP registers, etc.)
All the instructions are encoded in one of the following three formats:
Op code Rs Rt Immediate
R-type: Register to register operations
Op code Offset
The data and memory are well separated in MIPS implementation because:
The instruction formats for the operations are not unique; hence the memory access will
also be different.
Maintaining separate memory area is less expensive.
The operations of the processor are performed in single cycle. A single memory (for both
data and memory access) will not allow for two different accesses within one cycle.
The output of the combinatorial circuit The output depends on the previous stage
depends only on the current input. outputs.
It has faster operation speed and It has comparatively low operation speed
easy implementation. I and tough implementation.
For a given set of inputs, combinatorial The outputs vary based on previous outputs.
elements give the same output since
there is no storage of past data.
EC8552- Computer Architecture And Organization
The basic building blocks are gates, The basic building blocks are flip flops,
which are time independent. which are time dependent.
A clocking methodology is a set of rules for interconnecting components and clock signals
that, when followed, guarantee proper operation of the resulting system.
The primary objective of clocking methodology is timing correlation.
EC8552- Computer Architecture And Organization
This allows the processor to read the register contents, send the value through some
combinatorial logic and write that register in same clock cycle under the assumption that
the state elements are controlled by implicit clock cycles.
Here, the stored values are updated only on a clock edge.
In combinatorial logic, the input must be read, processed and the output must be sent to
the location, all in one single clock cycle (Fig 3.2 a).
The driving force of this combinatorial circuit will be an explicit control signal.
All the changes occur only when the clock signal is triggered.
In edge-triggered methodology, the contents of a register are read and the value is sent
through combinational logic, and written to that register in the same clock cycle.
This prevents the access of inconsistent intermediate data
Feedback cannot occur within 1 clock cycle because of the edge-triggered update of the
state element.
The clock cycle still must be long enough so that the input values are stable when the
active clock edge occurs.
Data path is a functional unit that operates or hold data. In the MIPS implementation the data
path elements includes instruction and data memories, the register file, the arithmetic logic unit
(ALU), and adders. The functionalities of basic elements are listed below:
Instruction Memory: It is a state element that provides read access because the data
path do not perform write operation. This combinatorial memory always holds contents
of location specified by the address.
Program Counter (PC): This is a 32 bit state register containing the address of the
current instruction that is being executed. It is updated after every clock cycle and do not
require an explicit write signal.
Adder: This is a combinatorial circuit that updates the value of PC after every clock cycle
to get that address of the next instruction to be executed.
The fundamental operation in Instruction Fetch is to send the address in the PC to the
instruction memory and obtain the specified instruction, and the increment the PC.
R type instructions:
They all read two registers, perform an ALU operation on the contents of the registers
and write the result.
This instruction class includes add, sub, and, or, and slt.
The processor╆s ぬに general-purpose registers are stored in a structure called a register file.
A register file is a collection of registers in which any register can be read or written by
specifying the number of the register in the file. The register file contains the register
state of the machine.
The R-format always performs ALU operation that has three register operands (2-read
and 1-write).
The register number must be specified in order to read the data from the register file.
Also the output from a register file will contain the data that is read from the register.
The write operation to a register has two inputs: the register number and the value to be
written. This operation is edge triggered.
Load and Store instructions:
The load and store instructions compute a memory address by adding the base register.
If the instruction is a load, the value read from memory must be written into the register
file in the specified register.
The memory is computed by adding the address of base register and the16-bit signed
offset field (which is a part of the instruction).
If the instruction is a store, the value to be stored must also be read from the register.
Branch Instructions:
Branch Target is the address specified in a branch, which is used to update the PC if the
branch is taken. In the MIPS architecture the branch target is computed as the sum of the
offset field of the instruction and the address of the instruction following the branch.
J. The beq instruction (branch instruction) has three operands, two registers that are
compared for equality, and a 16-bit offset to compute the branch target address. beq t1,
t2, offset
K. Thus, the branch data path must do two operations: compute the branch target address
L. Branch Taken is where the branch condition is satisfied and the program counter (PC)
loads the branch target. All unconditional branches are taken branches.
M. Branch not Taken is where the branch condition is false and the program counter (PC)
loads the address of the instruction that sequentially follows the branch.
N. The branch target is calculated by taking the address of the net instruction after the
branch instruction, since the PC value will be updated as PC+4 even before the branch
decision is taken
O. The offset field is shifted left 2 bits to increase the effective range of the offset field by a
factor of four.
EC8552- Computer Architecture And Organization
JJJ. When the condition is false, the execution looks like a normal branch.
KKK. When the condition is true, a delayed branch first executes the instruction
immediately following the branch in sequential instruction order before jumping to the
specified branch target address.
KKK. Delayed branches facilitate pipelining.
3.3.2 Creating a single Data path
A simple implementation of a single data path is to execute all operations within one
clock cycle.
The data path resources can be utilized only for one clock cycle. To facilitate this,
some resources must be duplicated for simultaneous access while other resources
will be shared.
One example is having separate memory for instructions and memory.
When a resource is used in shared mode, then multiple connections must be made.
The selection of which control will access the resource will be decided by a
multiplexer.
To implement branch instructions the data path must include an adder circuitry to
compute branch target (Refer Fig: 3.6).
The control unit for this data path must take inputs and generate a write signal for
each state element. Apart from the inputs a selector control must be included for
each multiplexor and the ALU control.
The operations of arithmetic-logical (or R-type) instructions and the memory
instructions data path are almost similar.
The arithmetic-logical instructions use the ALU with the inputs coming from the two
registers. The memory instructions can also use the ALU to do the address
calculation, but the second input is the sign-extended 16-bit offset field from the
instruction.
3.4 SIMPLE IMPLEMENTATION SCHEME
The basic implementation includes a subset of the core MIPS instruction set:
The memory-reference instructions load word (lw) and store word (sw).
The arithmetic-logical instructions add, sub, AND, OR, and slt.
The instructions branch equal (beq) and jump (j).
For any instruction, the following two steps are same:
Send the program counter (PC) to the memory that contains the code and fetch the
instruction from that memory.
Read one or two registers, using fields of the instruction to select the registers to read.
Load instruction needs to read only one register, but most other instructions require
reading two registers. The remaining actions required to complete the instruction
depend on the instruction class. For the three instruction classes namely memory-
reference, arithmetic-logical, and branches, the actions are mostly the same. This is due
to the simplicity and regularity of the MIPS instruction set.
EC8552- Computer Architecture And Organization
MIPS
Instruction format for load specified by op code = 35ten and store is specified by op code
43ten) instructions. The register rs is the base register that is added to the 16-bit address field
to form the memory address. For loads, rt is the destination register for the loaded value. For
stores, rt is the source register whose value should be stored into memory.
Instruction format for branch equal (op code = 4). The registers rs and rt are the source
registers that are compared for equality. The 16-bit address field is sign extended, shifted, and
added to the PC to compute the branch target address.
All instruction classes, except jump, use the arithmetic-logical unit (ALU) after reading
the registers.
The memory-reference instructions use the ALU for an address calculation, the
arithmetic-logical instructions for the operation execution, and branches for comparison.
After using the ALU, the actions required to complete various instruction classes differ.
A memory-reference instruction will need to access the memory either to read data for a
load or write data for a store.
An arithmetic-logical or load instruction must write the data from the ALU or memory
back into a register.
Branch instruction need to change the next instruction address based on the
comparison; otherwise, the PC should be incremented by 4 to get the address of the next
instruction.
All instructions start by using the program counter to supply the instruction address to
the instruction memory.
After the instruction is fetched, the register operands used by an instruction are
specified by fields of that instruction.
EC8552- Computer Architecture And Organization
Once the register operands have been fetched, they can be operated on to compute a
memory address (for a load or store), to compute an arithmetic result (for an integer
arithmetic-logical instruction), or a compare (for a branch).
If the instruction is an arithmetic-logical instruction, the result from the ALU must be
written to a register.
If the operation is a load or store, the ALU result is used as an address to either store a
value from the registers or load a value from memory into the registers.
The result from the ALU or memory is written back into the register file.
Branches require the use of the ALU output to determine the next instruction address,
which comes either from the ALU (where the PC and branch off set are summed) or from
an adder that increments the current PC by 4.
The thick lines interconnecting the functional units represent buses, which consist of
multiple signals.
Fig.3.12shows the data path of Fig 3.8 with the three required multiplexors added, and
control lines for the major functional units.
A control unit, which has the instruction as an input, is used to determine how to set the
control lines for the functional units and two of the multiplexors.
The third multiplexor, which determines whether PC + 4 or the branch destination
address is written into the PC, is set based on the Zero output of the ALU, which is used
to perform the comparison of a beq instruction.
EC8552- Computer Architecture And Organization
Four steps to execute the instruction; these steps are ordered by the flow of information:
The instruction is fetched, and the PC is incremented.
Two registers, $t2 and $t3, are read from the register file. the main control unit
computes the setting of the control lines during this step.
The ALU operates on the data read from the register file, using the function code (bits
5:0, which is the funct field, of the instruction) to generate the ALU function.
The result from the ALU is written into the register file using bits 15:11 of the
instruction to select the destination register ($t1).
EC8552- Computer Architecture And Organization
ALUSrc The second operand comes The second operand is the sign
from the second register extended lower 16 bits of the
file output (Read data 2). instruction.
The setting of the control lines is completely determined by the op code fields of the instruction
as given below:
Instruc- Reg ALU Memto Reg Mem Mem Branch ALU ALU
tion Dst Src Reg Write Read Write Op1 Op0
R-format 1 0 0 1 0 0 0 1 0
Lw 0 1 1 1 1 0 0 0 0
Sw x 1 x 0 0 1 0 0 0
beq x 0 x 0 0 0 1 0 1
EC8552- Computer Architecture And Organization
Inputs Op5 0 1 1 0
Op4 0 0 0 0
Op3 0 0 1 0
Op2 0 0 0 1
Op1 0 1 1 0
Op0 0 1 1 0
Outputs RegDst 1 0 x X
ALUSrc 0 1 1 0
MemtoReg 0 1 x X
RegWrite 1 1 0 0
0 0
MemWrite 0 0 1 0
Branch 0 0 0 1
ALUOp1 1 0 0 0
ALUOp0 0 0 0 1
EC8552- Computer Architecture And Organization
3.5 PIPELINING
Pipelining is an implementation technique in which multiple instructions are executed
simultaneously by overlapping them in execution to save time and resource. The previous
instruction will be in the execution phase when the current instruction is fetched from the
memory.
Without a pipeline, a computer processor fetches the first instruction from memory,
performs the operation mentioned in it, and then goes to fetch the next instruction from
memory. While fetching the instruction, the arithmetic unit of the processor is idle. It must wait
until it is loaded with next instruction.
With pipelining, the computer architecture allows the next instructions to be fetched
while the processor is performing arithmetic operations, holding them in a buffer close to the
processor. The result is an increase in the number of instructions that can be performed during a
given time period.
3.5.1 Stages in MIPS pipelining:
decreasing the execution time of an individual instruction, but increases the number of
instructions that complete its execution for a given time period. Thus the overall performance of
the processor is improved both in terms of resource utilization and throughput.
Fig 3.14 shows the comparison of execution of instructions with and without pipelining
on same hardware components. The timeline clearly indicates that there is a difference in
execution time and resource utilization. The challenges in implementing pipelining may arise
due to slowest resource.
3.5.2 Designing instruction sets for Pipelining
The simplicity and generality of MIPS instructions are that they are of same length. This
facilitates easy instruction fetching in the first stage of pipelining.
MIPS has only a few instruction formats. In every instruction format, the source operand
register is located at the same position in the instruction format.
This symmetry eases the instruction decode stage by reading the register file
simultaneously while the hardware is determining the type of instruction format.
Also, the memory operands appear in only in load or store instruction type in MIPS. So
that the execute stage can calculate the memory address and then access memory in the
following stage.
Operands must be aligned in memory. Hence, a single data transfer instruction requiring
two data memory accesses can be done in a single pipeline stage.
3.5.3 Hazards in Pipelining
Hazards are situations that prevent the next instruction in the instruction cycle from being
executing during its designated clock cycle. Hazards reduce the performance of the
pipelining.
They are attempt to use same resource by two or more instruction at the same time.
Example: In case of single memory is used for instructions and data access and when two
instructions are accessing the same register one at instruction fetch stage and other at memory
access stage. This leads to inconsistent data access.
Types of hazard:
Structural Hazards: They arise from resource conflicts when the hardware cannot support all
possible combinations of instructions in simultaneous overlapped execution.
EC8552- Computer Architecture And Organization
Data Hazards: They arise when an instruction depends on the result of a previous
instruction in a way that is exposed by the overlapping of instructions in the pipeline.
Control Hazards: They arise from the pipelining of branches and other instructions that
change the PC. This is also known as branch hazard. The flow of instruction addresses is
not what the pipeline had expected. This results in control hazard.
Data hazards occur when the pipeline must be stalled because one step must wait for
another to complete.
Data hazards occur in register files due to inconsistencies in file. This is an occurrence in
which a planned instruction cannot execute in the proper clock cycle because data that is needed
to execute the instruction is not yet available. In other words, data hazards occur when the
pipeline must be stalled because one step must wait for another to complete. This is due to the
data dependence.
Example: Consider the following instructions:
Here the sub instruction uses the result of add instruction ($s0). The add instruction
cannot not write its result until the fifth stage. This results in wasting three clock cycles in the
pipeline. Since the stall occurs due to the non availability of data, this is termed as data hazards.
The desired data would be available only after the fourth stage of the first instruction in
the dependence, which is too late for the input of the third stage of sub. Hence, even with
forwarding, there will be a hazard called as load-use data hazard.
A specific form of data hazard in which the data requested by a load instruction has not yet
become available when it is requested. This is Load-use data hazard.
The stall mentioned in Fig 3.16 is called bubble or pipeline stall. A pipeline stall is a
delay in execution of an instruction in order to resolve a hazard. During the decoding stage, the
control unit will determine if the decoded instruction reads from a register that the instruction
currently in the execution stage writes to.
EC8552- Computer Architecture And Organization
Problem 3.1
Find the hazards in the following code segment and reorder the instructions to avoid any
pipeline stalls.
lw $t1, 0($t0)
lw $t2, 4($t0)
add $t3, $t1,$t2
sw $t3, 12($t0)
lw $t4, 8($01)
add $t5, $t1,$t4
sw $t5, 16($t0)
Solution:
Both the add instructions have a hazard because of their dependence on the immediately
preceding lw instruction. Bypassing eliminates several other potential hazards including the
dependence of the first add on the first lw and any hazards for store instructions. Moving up the
third lw instruction eliminates both hazards. This is possible since the lw instruction is
independent of other operations:
lw $t1, 0($t0)
lw $t2, 4($t1)
lw $t4, 8($01)
add $t3, $t1,$t2
sw $t3, 12($t0)
add $t5, $t1,$t4
sw $t5, 16($t0)
All three values are stored in the ID/EX pipeline register, along with the incremented PC
address.
Transfer everything that might be needed by any instruction during a later clock cycle.
These first two stages are executed by all instructions, since it is too early to know the
type of the instruction.
Execute or address calculation:
The load instruction reads the contents of register 1 and the sign-extended immediate
from the ID/EX pipeline register and adds them using the ALU.
That sum is placed in the EX/MEM pipeline register.
Memory access:
The load instruction reading the data memory using the address from the EX/MEM
pipeline register and loading the data into the MEM/WB pipeline register.
The register containing the data to be stored was read in an earlier stage and stored in
ID/EX.
The only way to make the data available during the MEM stage is to place the data into
the EX/MEM pipeline register in the EX stage, just as we stored the effective address into
EX/MEM.
Write back:
This involves reading the data from the MEM/WB pipeline register and writing it into the
register file.
This section describes the necessary control lines for implementing a pipelined data
path. The control logic is needed for PC source, register destination number, and ALU control. A
6-bit funct field (function code) is needed for the instruction in the EX stage as input to ALU
control, so these bits must also be included in the ID/EX pipeline register. These 6 bits are the 6
least significant bits of the immediate field in the instruction, so the ID/EX pipeline register can
supply them from the immediate field since sign extension leaves these bits unchanged.
EC8552- Computer Architecture And Organization
There are no separate write signals for the pipeline registers (IF/ID, ID/EX, EX/ MEM,
and MEM/WB), since the pipeline registers are also written during each clock cycle.
To specify control for the pipeline, set the control values during each pipeline stage.
Because each control line is associated with a component active in only a single pipeline
stage.
The control lines are also divided into five groups according to the pipeline stage:
EC8552- Computer Architecture And Organization
Instruction fetch: The control signals to read instruction memory and to write the PC are
always asserted, so there is nothing special to control in this pipeline stage.
Instruction decode/register file read: As in the previous stage, the same thing happens at
every clock cycle, so there are no optional control lines to set.
Execution/address calculation: The signals to be set are Reg Dst, ALU Op, and ALU Src.
The signals select the Result register, the ALU operation, and either Read data 2 or a sign-
extended immediate for the ALU.
Memory access: The control lines set in this stage are Branch, Mem Read, and Mem Write.
These signals are set by the branch equal, load, and store instructions, respectively.
Write back: The two control lines are Mem to Reg, which decides between sending the ALU
result or the memory value to the register file, and Reg Write, which writes the chosen value.
Implementing control means setting the nine control lines to these values in each stage
for each instruction (explained in simple implementation scheme). The simplest way to do this is
to extend the pipeline registers to include control information.
Data hazards occur when the pipeline must be stalled because one step must wait for
another to complete.
Data hazards occur in register files due to inconsistencies in file. This is an occurrence in
which a planned instruction cannot execute in the proper clock cycle because data that is needed
to execute the instruction is not yet available. In other words, data hazards occur when the
pipeline must be stalled because one step must wait for another to complete. This is due to the
data dependence.
3.8.1 Forwarding or Bypassing
Forwarding cannot be a universal solution to solve data hazards. Consider the following
instructions:
lw $s0, 20($t1)
sub $t2, $s0, $t3
The desired data would be available only after the fourth stage of the first instruction in the
dependence, which is too late for the input of the third stage of sub. Hence, even with
forwarding, there will be a hazard called as load-use data hazard.
A specific form of data hazard in which the data requested by a load instruction has not yet
become available when it is requested. This is Load-use data hazard.
Consider the following code:
sub $2, $1, $3
and $12, $2, $5
or $13, $6, $2
add $14, $2, $2
sw $15, 100($2)
There are several dependences in this code fragment:
The first instruction, SUB, stores a value into $2.
That register is used as a source in the rest of the instructions this is no problem for 1-
cycle and multi cycle data path.
Each instruction executes completely before the next begins.
This ensures that instructions 2 through 5 above use the new value of $2.
The SUB does not write to register $2 until clock cycle 5 causing 2 data hazards in our
pipelined data path.
The AND reads register $に in cycle ぬ. Since SUB hasn╆t modified the register yet, this is
the old value of $2
The OR instruction uses register $に in cycle ね, again before it╆s actually updated by SUB.
To avoid data hazard, rewrite the instructions (sll means stall):
or $13, $6, $2
sw $15, 100($2)
Since it takes two instruction cycles to get the value stored, one solution is for the assembler to
insert no-ops or for compilers to reorder instructions to do useful work while the pipeline
proceeds. Since the pipeline registers already contain the ALU result, we could just forward the
value to later instructions, to prevent data hazards
In clock cycle 4, the AND instruction can get the value of $1 - $3 from the EX/MEM
pipeline register used by SUB.
Then in cycle 5, the OR can get that same result from the MEM/WB pipeline register
being used by SUB.
EC8552- Computer Architecture And Organization
Forward the data as soon as it is available to any units that need it before it is available to
read from the register file. This is forwarding in data hazards.
When an instruction tries to use a register in its EX stage that an earlier instruction
intends to write in its WB stage, we actually need the values as inputs to the ALU. Te general
format for specifying dependencies is given by:
Pipeline register. Field in the register
Example: ID/EX .Register Rs- refers that the value is found in the pipeline register ID/EX in the
field Register Rs. The dependencies in the given example are:
EX/MEM .Register Rd = ID/EX .Register Rs
EX/MEM. Register Rd = ID/EX .Register Rt
MEM/WB. Register Rd = ID/EX .Register Rs
MEM/WB .Register Rd = ID/EX .Register Rt
EC8552- Computer Architecture And Organization
The first hazard in the sequence is on register $2, between the result of sub $2,$1,$3 and the first
read operand of and $12,$2,$5. This hazard can be detected when the AND instruction is in the
EX stage and the prior instruction is in the MEM stage.
EX/MEM .Register Rd = ID/EX .Register Rs = $2
Forwarding the inputs to the ALU from any pipeline registers done by adding
multiplexors to the input of the ALU and with the proper controls. By this the pipeline can be
executed at full speed in the presence of these data dependences.
3.8.2 Stalling
A bubble is inserted beginning in clock cycle 4, by changing the AND instruction to a nop (no
operation). Note that the and instruction is really fetched and decoded in clock cycles 2 and 3,
but its EX stage is delayed until clock cycle 5. The or instruction is fetched in clock cycle 3, but its
IF stage is delayed until clock cycle 5. After insertion of the bubble, all the dependences go
forward in time and no further hazards occur.
EC8552- Computer Architecture And Organization
This occurs when there is a need for an instruction to take a decision based on the results
of another instruction╆s result that has no yet completed its execution.
Control or branching hazards arise from resource conflicts when the hardware cannot
support all possible combinations of instructions in simultaneous overlapped execution.
Instructions that disrupt the sequential flow of control present problems for pipelines are
potential candidates for control hazards. The effects of these instructions cannot be exactly
determined until late in the pipeline, so instruction fetch cannot continue unless it is explicitly
managed. The following types of instructions can introduce control hazards:
Unconditional branches
Conditional branches
Indirect branches
Procedure calls
Procedure returns
Example:
ldi r1, 1 // r1 := 1
This code compares two memory locations and stores the result of that comparison (1
for equal, 0 for not equal) to another location. If the beqz branch is taken, then a 1 is stored;
otherwise, a 0 is stored. The beqz instruction sources two hazards:
When the beqz instruction is in the decode stage, the sub instruction is in the execute
stage. The branch cannot read the output of the sub until it has been written to the
register file; if it reads it early, it will read the wrong value.
The instruction that is to be fetched after beqz is not known in advance. At this point, the
status of the branch instruction is totally unknown whether it depends on the previous
instruction or not. This is because it hasn╆t been de coded yet, so bypassing also can╆t help in
resolving the hazard. Even if the decision is known, the location from where to
fetch the instruction if the branch is taken is unknown because the effective address
computation for branches do not happen until the EX stage.
Branch prediction: The outcome and target of conditional branches are predicted using
some heuristic. Instructions are speculatively fetched and executed down the predicted
path, but results are not written back to the register file until the branch is executed and
the prediction is verified. When a branch is predicted, the processor enters a speculative
mode in which results are written to another register file that mirrors the architected
register file. Another pipeline stage called the commit stage is introduced to handle
writing verified speculatively obtained results back into the real register file. Branch
predictors can╆t be などど% accurate, so there is still a penalty for branches that is based on
the branch mis prediction rate.
EC8552- Computer Architecture And Organization
Return address stack (RAS): Procedure returns are a form of indirect jump that can be
perfectly predicted with a stack as long as the call depth doesn╆t exceed the stack depth.
Return addresses are pushed onto the stack at a call and popped off at a return.
Branch prediction is a method of resolving a branch hazard that assumes a given outcome
for the branch and proceeds from that assumption rather than waiting for the actual
outcome.
In general, the bottoms of loops are branches that jump back to the top of the loop. These
types of loops can easily be predicted as branch taken.
The decision about a branch whether taken or not taken is arrived from the heuristics.
Dynamic hardware predictors, guess the behavior of each branch and may change
predictions for a branch over the life of a program.
Dynamic prediction is performed by maintaining a history for each branch as taken or
untaken, and then using the recent past behavior to predict the future.
When the guess is wrong, the pipeline control must ensure that the instructions
following the wrongly guessed branch have no effect and must restart the pipeline from
the proper branch address.
Branch Stalling
This is stalling the instructions until the branch is complete is too slow.
One improvement over branch stalling is to predict that the branch will not be taken and
thus continue execution down the sequential instruction stream.
If the branch is taken, the instructions that are being fetched and decoded must be
discarded. Execution continues at the branch target.
If branches are untaken half the time, and if it costs little to discard the instructions, this
optimization halves the cost of control hazards.
To discard instructions, change the original control values to 0s.
Delayed Branches:
The delayed branch always executes the next sequential instruction, with the branch taking
place after that one instruction delay. It is hidden from the MIPS assembly language
programmer because the assembler can automatically arrange the instructions to get the
branch behavior desired by the programmer.
One way to improve branch performance is to reduce the cost of the taken branch.
The MIPS architecture was designed to support fast single-cycle branches that could be
pipelined with a small branch penalty.
EC8552- Computer Architecture And Organization
Assume predict bit=0 to start (indicates branch not taken) and loop control is at the
bottom of the code.
First iteration in the loop, the predictor mispredict the branch since the branch is taken
back to the top of the loop. Now invert the prediction bit (predict bit=1).
Exiting the loop, the predictor again mispredict the branch since this time the branch is
not taken falling out of the loop. Now invert the prediction bit (predict bit=0).
EC8552- Computer Architecture And Organization
The slot directly after a delayed branch instruction, which in the MIPS architecture is
filled by an instruction that does not affect the branch.
EC8552- Computer Architecture And Organization
The limitations on delayed branch scheduling arise from the restrictions on the
instructions that are scheduled into the delay slots the ability to predict at compile time
whether a branch is likely to be taken or not.
Delayed branching was a simple and effective solution for a five-stage pipeline issuing
one instruction each clock cycle.
As processors go to both longer pipelines and issuing multiple instructions per clock
cycle, the branch delay becomes longer, and a single delay slot is insufficient.
Hence, delayed branching has lost popularity compared to more expensive but more
flexible dynamic approaches.
3.10 EXCEPTIONS
Control is the most challenging aspect of processor design: One of the hardest parts of
control is implementing exceptions and interrupts events other than branches or jumps that
change the normal flow of instruction execution. They were initially created to handle
unexpected events from within the processor, like arithmetic overflow. The term exception refer
to any unexpected change in control flow without distinguishing whether the cause is internal or
external. Interrupt is when the event is externally caused. The following are the causes of
exceptions:
R-type arithmetic overflow
Executing undefined instruction
I/O device request
OS service request
Hardware malfunction
Detecting exceptional conditions and taking the appropriate action is often on the critical
timing path of a processor, which determines the clock cycle time and performance.
Exception Handling in the MIPS Architecture:
The two types of exceptions that MIPS implementation can generate are execution of an
undefined instruction and an arithmetic overflow.
from the new address. This is done by turning the IF stage into a nop. Because of careful
planning, the overflow exception is detected during the EX stage; hence, we can use the EX.Flush
signal to prevent the instruction in the EX stage from writing its result in the WB stage. The final
step is to save the address of the off ending instruction in the exception program counter (EPC).
In reality, we save the address +4, so the exception handling the software routine must first
subtract 4 from the saved value.
The ILP increases the depth of the pipeline to overlap more instructions. This is
facilitated by adding extra hardware resources to replicate the internal component of the
computer, so that it can launch multiple instructions in every pipeline stages. This is called
multiple issue.
In Multiple Issue technique, multiple instructions are launched in one clock cycle.
This will improve the performance of the processor. The pipelined performance is estimated
from the given formula (CPI-Cycles Per Instruction):
Pipeline CPI = Ideal CPI + Structural stalls + RAW stalls + WAR stalls + WAW stalls + Control
stall
Launching multiple instructions per stage allows the instruction execution rate (CPI) to
be less than 1. To obtain substantial increase in performance, we need to exploit parallelism
across multiple basic blocks.
Implementing multiple issue processor
Static multiple issue processor: Here the decisions are made by the compiler before
execution.
Dynamic multiple issue processor: Here the decisions are made during the execution
by the processor.
EC8552- Computer Architecture And Organization
Dealing with data and control hazards: In static issue processors, the consequences of
data and control hazards are handled statically by the compiler. In dynamic issue processors,
use hardware techniques to mitigate the control and data hazard.
3.11.1 Speculation
This allows the execution of complete instructions or parts of instructions before being
certain whether this execution should take place.
Issue in Speculation:
Speculating on certain instructions may introduce exceptions that were formerly not
present. The result would be that an exception that should not have occurred will occur. In
EC8552- Computer Architecture And Organization
Compiler-based speculation, such problems are avoided by adding special speculation support
that allows such exceptions to be ignored until it is clear that they really should occur. In
hardware-based speculation, exceptions are simply buffered until it is clear that the instruction
causing them is no longer speculative and is ready to complete; at that point the exception is
raised, and normal exception handling proceeds.
The set of instructions that issues together in 1 clock cycle; the packet may be determined
statically by the compiler or dynamically by the processor.
Static multiple-issue processors use compiler to assist with packaging instructions and
handling hazards. The issue packet is treated as one large instruction with multiple operations.
This is otherwise termed as Very Long Instruction Word (VLIW).Since the Intel IA-64
architecture supports this approach, it is known as Explicitly Parallel Instruction Computer
(EPIC).
Loop unrolling is a technique used by compiler to solve static multiple issue.
Loop Unrolling is a technique to get more performance from loops that access arrays, in
which multiple copies of the loop body are made and instructions from different it rations
are scheduled together.
Loop unrolling is a compiler optimization applied to certain kinds of loops to reduce the
frequency of branches and loop maintenance instructions. It is easily applied to sequential array
processing loops where the number of iterations is known prior to execution of the loop. After
unrolling, there is more ILP available by overlapping instructions from different iterations.
During the unrolling process, the compiler introduced additional registers, since multiple
copies of the loop body are made.
Augmenting new registers in loop unrolling is called register renaming. This is done to
eliminate dependences that are not true data dependences, but may lead to potential
hazards or may prevent the compiler from scheduling the code.
To identify the independent instructions, it is necessary to trace the data dependencies.
If there is no data values flow between the instructions, it is termed as anti-dependence
or name dependence. This is an ordering forced purely by the reuse of a name.
EC8552- Computer Architecture And Organization
Renaming the registers during the unrolling process allows the compiler to the
independent instructions for better code schedule.
An instruction group is a sequence of consecutive instructions with no register data
dependences among them.
All the instructions in a group could be executed in parallel if sufficient hardware
resources existed and if any dependences through memory were preserved.
The compiler must explicitly indicate the boundary between one instruction group and
another. This boundary is indicated by placing a stop between two instructions that
belong to different groups.
An explicit indicator of a break between independent and dependent instructions is
termed as stop.
Predication is a technique that can be used to eliminate branches by making the
execution of an instruction dependent on a predicate, rather than dependent on a
branch.
Speculation and Predication improves ILP. Branches reduce the opportunity to exploit
ILP by restricting the movement of code.
Branches within a loop cannot be eliminated by loop unrolling. Predication eliminates
this branch, by allowing more flexible exploitation of parallelism.
Speculation consists of separate support for control speculation, which deals with
deferring exceptions for speculated instructions, and memory reference speculation,
which supports speculation of load instructions.
Deferred exception handling is supported by adding speculative load instructions, which,
when an exception occurs, tag the result as poison.
Poison is the result generated when a speculative load yields an exception, or an
instruction uses a poisoned operand. When a poisoned result is used by an instruction,
the result is also poison, the software can then check for a poisoned result when it knows
that the execution is no longer speculative.
The speculation on memory references can be made by moving loads earlier than stores
on which they may depend. This is done with an advanced load instruction.
EC8552- Computer Architecture And Organization
Advanced load is speculative load instruction with support to check for aliases that could
invalidate the load. This demands the use of a special table to track the address that the
processor loaded from.
A subsequent instruction must be used to check the status of the entry after the load is
no longer speculative.
Dynamic multiple issue processors are implemented using superscalar processors that
are capable of executing more than one instruction per clock cycle.
The compiler must schedule the instructions to the processors without any
dependencies.
Instruction Fetch Unit: This unit fetches instructions, decodes them, and sends each
instruction to a corresponding functional unit for execution.
Functional unit: They have buffers, called reservation stations that hold the operands
and the operation. As soon as the buffer contains all its operands and the functional unit
is ready to execute, the result is calculated. When the result is completed, it is sent to any
reservation stations waiting for this particular result as well as to the commit unit.
Commit Unit: This buffers the result until it is safe to put the result into the register file
or, for a store, into memory. The buffer in the commit unit, called the reorder buffer, is
also used to supply operands, in much the same way as forwarding logic does in a
statically scheduled pipeline. Once a result is committed to the register file, it can be
fetched directly from there, just as in a normal pipeline.
Operation of dynamic scheduling pipeline:
When an instruction issues, if either of its operands is in the register file or the reorder
buffer, it is copied to the reservation station immediately, where it is buffered until all
the operands and an execution unit are available. For the issuing instruction, the register
copy of the operand is no longer required, and if a write to that register occurred, the
value could be overwritten.
If an operand is not in the register file or reorder buffer, it must be waiting to be
produced by a functional unit. The name of the functional unit that will produce the
result is tracked. When that unit eventually produces the result, it is copied directly into
the waiting reservation station from the functional unit bypassing the registers.
Dynamic scheduling is often extended by including hardware-based speculation, especially
for branch outcomes. By predicting the direction of a branch, a dynamically scheduled processor
can continue to fetch and execute instructions along the predicted path.
EC8552- Computer Architecture And Organization
UNIT - IV
MEMORYAND I/O ORGANIZATION
4.1 INTRODUCTION
Memory unit enables us to store data inside the computer. The computer memory
always had here╆s to principle of locality.
Principle of locality or locality of reference is the tendency of a processor to access the same
set of memory locations repetitively over a short period of time.
Two different types of locality are:
Temporal locality: The principle stating that if a data location is referenced then it will
tend to be referenced again soon.
Spatial locality: The locality principle stating that if a data location is referenced, data
locations with nearby addresses will tend to be referenced soon.
The locality of reference is useful in implementing the memory hierarchy.
Memory hierarchy is a structure that uses multiple levels of memories; as the distance from
the CPU increases, the size of the memories and the access time both increase.
A memory hierarchy consists of multiple levels of memory with different speeds and sizes. The
faster memories are more expensive per bit than the slower memories and thus smaller.
EC8552- Computer Architecture And Organization
Cache memory (CPU memory) is high-speed SRAM that a computer Microprocessor can
access more quickly than it can access regular RAM. This memory is typically integrated
directly into the CPU chip or placed on a separate chip that has a separate bus interconnect
with the CPU.
in the upper level, this is called a hit. If the data is not found in the upper level, the request is
called a miss. The lower level in the hierarchy is then accessed to retrieve the block containing
the requested data.
The fraction of memory accesses found in a cache is termed as hit rate or hit ratio.
Miss rate is the fraction of memory accesses not found in a level of the memory hierarchy. Hit
time is the time required to access a level of the memory hierarchy, including the time needed to
determine whether the access is a hit or a miss.
Miss penalty is the time required to fetch a block into a level of the memory hierarchy
from the lower level, including the time to access the block, transmit it from one level to
the other, and insert it in the level that experienced the miss.
Because the upper level is smaller and built using faster memory parts, the hit time will
be much smaller than the time to access the next level in the hierarchy, which is the major
component of the miss penalty.
Miss: If the requested data is not found in the upper levels of memory hierarchy it is called
miss.
Hit rate or Hit ratio: It is the fraction of memory access found in the upper level .It is a
performance metric.
Hit Ratio = Hit/ (Hit + Miss)
Miss rate: It is the fraction of memory access not found in the upper level (1-hit rate).
Hit Time: The time required for accessing a level of memory hierarchy, including the time
needed for finding whether the memory access is a hit or miss.
Miss penalty: The time required for fetching a block into a level of the memory hierarchy
from the lower level, including the time to access, transmit, insert it to new level and pass
the block to the requestor.
Bandwidth: The data transfer rate by the memory.
Latency or access time: Memory latency is the length of time between the memory╆s
receipt of a read request and its release of data corresponding with the request.
Cycle time: It is the minimum time between requests to memory.
SRAM DRAM
Stores data till the power is supplied. Stored data only for few milliseconds
irrespective of the power supply.
Uses nearly 6 transistors for each memory cell. Uses single transistor and capacitor
for each memory cell.
Do not refresh the memory cell. Refreshing circuitry is needed.
They are made of more number of They are made of less number of
components per cells. components per cells.
Primary Memory:
Primary memory is the main area in a computer in which data is stored for quick access by the
computer╆s processor. )t is divided into two parts:
i) Random Access Memory (RAM):
RAM is a type of computer primary memory. It accessed any piece of data at any time.
RAM stores data for as long as the computer is switched on or is in use. This type of memory is
volatile. The two types of RAM are:
Static RAM: This type of RAM is static in nature, as it does not have to be refreshed at
regular intervals. Static RAM is made of large number of flip-flops on IC. It is being costlier
and having packing density.
Dynamic RAM: This type of RAM holds each bit of data in an individual capacitor in an
integrated circuit. It is dynamic in the sense that the capacitor charge is repeatedly
refreshed to ensure the data remains intact.
ii) Read Only Memory (ROM):
The ROM is nonvolatile memory. It retains stored data and information if the power is
turned off. )n ROM, data are stored permanently and can╆t alter by the programmer. There are four types of
ROM:
MROM (mask ROM): MROM (mask ROM) is manufacturer-Programmed ROM in which
data is burnt in by the manufacturer of the electronic equipment in which it is used and it is
not possible for a user to modify programs or data stored inside the ROM chip.
PROM (programmable ROM): PROM is one in which the user can load and store ╉read-
only╊ programs and data. )n PROM the programs or data are stored only fast time and the
stored data cannot modify the user.
EPROM (erasable programmable ROM): EPROM is one in which is possible to erase
information stored in an EPROM chip and the chip can be reprogrammed to store new
information. When an EPROM is in use, information stored in it can only be ╉ read╊ and the
information remains in the chip until it is erased.
EEPROM (electronically erasable and programmable ROM): EEPROM is one type of
EPROM in which the stored information is erased by using high voltage electric pulse. It is
easier to alter information stored in an EEPROM chip.
EC8552- Computer Architecture And Organization
Secondary Memory:
Secondary memory is where programs and data are kept on a long time basis. It is
cheaper from of memory and slower than main or primary memory. It is non-volatile and cannot
access data directly by the computer processor. It is the external memory of the computer
system.
Example: hard disk drive, floppy disk, optical disk/ CD-ROM.
The basic memory element called cell can be in two states (0 or 1). The data can be written
into the cell and can be read from it.
During a Read or a Write operation, the row address is applied first. In response to a signal
pulse on the Row Address Strobe (RAS) input of the chip, this part of the address is loaded
into the row address latch.
All cell of this particular row is selected. Shortly after the row address is latched, the column
address is applied to the address pins.
It is loaded into the column address latch with the help of Column Address Strobe (CAS)
signal, similar to RAS.
The information in this latch is decoded and the appropriate Sense/Write circuit is selected.
The cache memory stores instructions and data that are more frequently used or data
that is likely to be used next. The processor looks first in the cache memory for the data. If it
finds the instructions or data then it does perform a more time-consuming reading of data from
larger main memory or other data storage devices.
The processor do not need to know the exact location of the cache. It can simply issue
read and write instructions. The cache control circuitry determines whether the requested data
resides in the cache.
Cache and temporal reference: When data is requested by the processor, the data should
be loaded in the cache and should be retained till it is needed again.
Cache and spatial reference: Instead of fetching single data, a contiguous block of data is
loaded into the cache.
Terminologies in Cache
Split cache: It has separate data cache and a separate instruction cache. The two caches
work in parallel, one transferring data and the other transferring instructions.
A dual or unified cache: The data and the instructions are stored in the same cache. A
combined cache with a total size equal to the sum of the two split caches will usually have a
better hit rate.
Mapping Function: The correspondence between the main memory blocks and those in
the cache is specified by a mapping function.
EC8552- Computer Architecture And Organization
Cache Replacement: When the cache is full and a memory word that is not in the cache is
referenced, the cache control hardware must decide which block should be removed to
create space for the new block that contains the referenced word. The collection of rules for
making this decision is the replacement algorithm.
Hit ratio = hit / (hit + miss) = Number of hits/ Total accesses to the cache
Miss penalty or cache penalty is the sum of time to place a bock in the cache and time to deliver
the block to CPU.
Miss Penalty= time for block replacement + time to deliver the block to CPU
Cache performance can be enhanced by using higher cache block size, higher associativity,
reducing miss rate, reducing miss penalty, and reducing the time to hit in the cache. CPU
execution Time of a given task is defined as the time spent by the system executing that task,
including the time spent executing run-time or system services.
CPU execution time=(CPU clock cycles + memory stall cycles (if any))
x Clock cycle time
The memory stall cycles are a measure of count of the memory cycles during which the CPU is
waiting for memory accesses. This is dependent on caches misses and cost per miss (cache
penalty).
Memory stall cycles = number of cache misses x miss penalty
Instruction Count x (misses/ instruction) x miss penalty
Instruction Count (IC) x (memory access/ instruction) x miss penalty
IC x Reads per instruction x Read miss rate X Read miss penalty + IC x
Write per instruction x Write miss rate X Write miss penalty
EC8552- Computer Architecture And Organization
Direct Mapping
The simplest technique is direct mapping that maps each block of main memory into only
one possible cache line.
Here, each memory block is assigned to a specific line in the cache.
If a line is previously taken up by a memory block and when a new block needs to be
loaded, then the old block is replaced.
Direct mapping╅s performance is directly proportional to the (it ratio.
The direct mapping concept is if the ith block of main memory has to be placed at the jth
block of cache memory j = i % (number of blocks in cache memory)
Consider a 128 block cache memory. Whenever the main memory blocks 0, 128, 256 are
loaded in the cache, they will be allotted cache block 0, since j= (0 or 128 or 256) % 128 is
zero).
Contention or collision is resolved by replacing the older contents with latest contents.
The placement of the block from main memory to the cache is determined from the 16 bit
memory address.
The lower order four bits are used to select one of the 16 words in the block.
The 7 bit block field indicates the cache position where the block has to be stored.
The 5 bit tag field represents which block of main memory resides inside the cache.
This method is easy to implement but is not flexible.
Drawback: The problem was that every block of main memory was directly mapped to the
cache memory. This resulted in high rate of conflict miss. Cache memory has to be very
frequently replaced even when other blocks in the cache memory were present as empty.
EC8552- Computer Architecture And Organization
The simplest way to keep the main memory and the cache consistent is to always write the
data into both the memory and the cache. This scheme is called write-through.
Write through is a scheme in which writes always update both the cache and the memory,
ensuring that data is always consistent between the two.
With a write-through scheme, every write causes the data to be written to main memory.
These writes will take a long time.
A potential solution to this problem is deploying write buffer.
A write buffer stores the data while it is waiting to be written to memory.
After writing the data into the cache and into the write buffer, the processor can continue
execution.
When a write to main memory completes, the entry in the write buffer is freed.
If the write buffer is full when the processor reaches a write, the processor must stall until
there is an empty position in the write buffer.
If the rate at which the memory can complete writes is less than the rate at which the
processor is generating writes, no amount of buffering can help because writes are being
generated faster than the memory system can accept them.
Write buffer is a queue that holds data while the data are waiting to be
written to memory.
iii) The rate at which writes are generated may also be less than the rate at which the memory
can accept them, and yet stalls may still occur. To reduce the occurrence of such stalls,
processors usually increase the depth of the write buffer beyond a single entry.
iv) Another alternative to a write-through scheme is a scheme called write-back. When a write
occurs, the new value is written only to the block in the cache.
v) The modified block is written to the lower level of the hierarchy when it is replaced.
vi) Write-back schemes can improve performance, especially when processors can generate
writes as fast or faster than the writes can be handled by main memory; a write-back
scheme is, however, more complex to implement than write-through.
Write-back is a scheme that handles writes by updating values only to the block in the
cache, then writing the modified block to the lower level of the hierarchy when the block is
replaced.
EC8552- Computer Architecture And Organization
The concept of virtual memory in computer organization is allocating memory from the
hard disk and making that part of the hard disk as a temporary RAM. In other words, it is a
technique that uses main memory as a cache for secondary storage. The motivations for
virtual memory are:
To allow efficient and safe sharing of memory among multiple programs
To remove the programming burdens of a small, limited amount of main memory.
Virtual memory provides an illusion to the users that the PC has enough primary memory
left to run the programs. Sometimes the size of programs to be executed may sometimes
very bigger than the size of primary memory left, the user never feels that the system needs
a bigger primary storage to run that program. When the RAM is full, the operating system
occupies a portion of the hard disk and uses it as a RAM. In that part of the secondary
storage, the part of the program which not currently being executed is stored and all the
parts of the program that are executed are first brought into the main memory. This is the
theory behind virtual memory.
Terminologies:
Physical address is an address in main memory.
Protection is a set of mechanisms for ensuring that multiple processes sharing the
processor, memory, or I/O devices cannot interfere, with one another by reading or writing
each other╆s data.
Virtual memory breaks programs into fixed-size blocks called pages.
Page fault is an event that occurs when an accessed page is not present in main memory.
Virtual address is an address that corresponds to a location in virtual space and is
translated by address mapping to a physical address when memory is accessed.
Address translation or address mapping is the process by which a virtual address is
mapped to an address used to access memory.
Working mechanism
In virtual memory, blocks of memory are mapped from one set of addresses (virtual
addresses) to another set (physical addresses).
EC8552- Computer Architecture And Organization
The processor generates virtual addresses while the memory is accessed using physical
addresses.
Both the virtual memory and the physical memory are broken into pages, so that a virtual
page is really mapped to a physical page.
It is also possible for a virtual page to be absent from main memory and not be mapped to a
physical address, residing instead on disk.
Physical pages can be shared by having two virtual addresses point to the same physical
address. This capability is used to allow two different programs to share data or code.
Virtual memory also simplifies loading the program for execution by providing relocation.
Relocation maps the virtual addresses used by a program to different physical addresses
before the addresses are used to access memory. This relocation allows us to load the
program anywhere in main memory.
Page faults= 15
2. Last In First Out (LIFO) page replacement algorithm
It replaces the newest page that arrived at last in the main memory. It is implemented by
keeping track of all the pages in a stack.
3. Least Recently Used (LRU) page replacement algorithm The new page will be replaced
with least recently used page.
Example 4.6: Consider the following reference string. Calculate the number of page faults
when the page frame size is 3 using LRU policy.7, 0, 1, 2, 0, 3, 4, 2, 3, 0, 3, 2,1,2,0,1,7,0,1
If the entry is not found in TLB (TLB miss) then CPU has to access page table in the
main memory and then access the actual frame in the main memory. Therefore, in the case
of TLB hit, the effective access time will be lesser as compare to the case of TLB miss.
EC8552- Computer Architecture And Organization
If the probability of TLB hit is P% (TLB hit rate) then the probability of TLB miss (TLB miss
rate) will be (1-P) %. The effective access time can be defined as
Effective access time = P (t + m) + (1 - p) (t + k.m + m)
Where, p is the TLB hit rate, t is the time taken to access TLB, m is the time taken to access
main memory. K indicates the single level paging has been implemented.
Hardware Level:
Memory protection at hardware level is done in three methods:
The machine should support two modes: supervisor mode and user mode. This indicates
whether the current running process is a user or supervisory process. The processes
running in supervisor or kernel mode is an operating system process.
Include user / supervisor bit in TLB to indicate whether the process is in user or supervisor
mode. This is an access control mechanism imposed on the user process only to read from
the TLB and not write to it.
The processors can switch between user and supervisor mode. The switching from user to
system mode is done through system calls that transfers control to a dedicated location in
supervisor code space.
System call is a special instruction that transfers control from user mode to a
dedicated location in supervisor code space, invoking the exception mechanism
in the process.
4.4 PARALLEL BUS ARCHITECTURES
Single bus architectures connect multiple processors with their own cache memory using
shared bus. This is a simple architecture but it suffers from latency and bandwidth issues.
This naturally led to deploying parallel or multiple bus architectures. Multiple bus
multiprocessor systems use several parallel buses to interconnect multiple processors with
multiple memory modules. The following are the connection schemes in multi bus
architectures:
memory connection has each memory module connected to a specific bus. For N processors
with M memory modules and B buses, the number of connections requires are: B(N+M) and
the load on each bus will ne N+M.
Fig 4.16 b) Multiple bus with single bus memory connection (MBSBMC)
A bus can be classified as synchronous or asynchronous. The time for any transaction over a
synchronous bus is known in advance. Asynchronous bus depends on the availability of
data and the readiness of devices to initiate bus transactions.
The processors that want to use the bus submit their requests to bus arbitration logic. The
latter decides, using a certain priority scheme, which processor will be granted access to the
bus during a certain time interval (bus master).
The process of passing bus mastership from one processor to another is called
handshaking, which requires the use of two control signals: bus request and bus grant.
Bus request indicates that a given processor is requesting mastership of the bus.
Bus grant: indicates that bus mastership is granted.
Bus busy: is usually used to indicate whether or not the bus is currently being used.
In deciding which processor gains control of the bus, the bus arbitration logic uses a
predefined priority scheme.
Among the priority schemes used are random priority, simple rotating priority, equal
priority, and least recently used (LRU) priority.
After each arbitration cycle, in simple rotating priority, all priority levels are reduced one
place, with the lowest priority processor taking the highest priority. In equal priority, when
two or more requests are made, there is equal chance of any one request being processed.
In the LRU algorithm, the highest priority is given to the processor that has not used the bus
for the longest time.
EC8552- Computer Architecture And Organization
When valid data are in the disk controller╆s buffer, DMA can begin. The DMA controller
initiates the transfer by issuing a read request over the bus to the disk controller.
This read request looks like any other read request, and the disk controller does not know
whether it came from the CPU or from a DMA controller.
The memory address to write to is on the bus address lines, so when the disk controller
fetches the next word from its internal buffer, it knows where to write it.
The write to memory is another standard bus cycle.
When the write is complete, the disk controller sends an acknowledgement signal to the
DMA controller, also over the bus.
The DMA controller then increments the memory address to use and decrements the byte
count. If the byte count is still greater than 0, steps 2 through 4 are repeated until the count
reaches 0.
At that time, the DMA controller interrupts the CPU to let it know that the transfer is now
complete.
When the operating system starts up, it does not have to copy the disk block to memory; it
is already there.
The DMA controller requests the disk controller to transfer data from the disk controller╆s
buffer to the main memory. In the first step, the CPU issues a command to the disk
controller telling it to read data from the disk into its internal buffer.
Serial Peripheral Interface (SPI) is an interface bus designed by Motorola to send data
between microcontrollers and small peripherals such as shift registers, sensors, and SD
cards. It uses separate clock and data lines, along with a select line to choose the device.
A standard SPI connection involves a master connected to slaves using the serial clock
(SCK), Master Out Slave In (MOSI), Master In Slave Out (MISO), and Slave Select
(SS) lines.
The SCK, MOSI, and MISO signals can be shared by slaves while each slave has a unique SS
line.
The SPI interface defines no protocol for data exchange, limiting overhead and allowing for
high speed data streaming.
Clock polarity ゅCPOLょ and clock phase ゅCP(Aょ can be specified as ╅ど╆ or ╅な╆ to form four
unique modes to provide flexibility in communication between master and slave.
)f CPOL and CP(A are both ╅ど╆ ゅdefined as Mode どょ data is sampled at the leading rising edge
of the clock. Mode 0 is by far the most common mode for SPI bus slave communication.
)f CPOL is ╅な╆ and CP(A is ╅ど╆ ゅMode にょ, data is sampled at the leading falling edge of the clock.
Likewise, CPOL = ╅ど╆ and CP(A = ╅な╆ ゅMode なょ results in data sampled at on the trailing falling edge
and CPOL = ╅な╆ with CP(A = ╅な╆ ゅMode ぬょ results in data sampled on the trailing
rising edge.
EC8552- Computer Architecture And Organization
In addition to the standard 4-wire configuration, the SPI interface has been extended to
include a variety of IO standards including 3-wire for reduced pin count and dual or quad
I/O for higher throughput.
In 3-wire mode, MOSI and MISO lines are combined to a single bidirectional data line.
Transactions are half-duplex to allow for bidirectional communication. Reducing the
number of data lines and operating in half-duplex mode also decreases maximum possible
throughput; many 3-wire devices have low performance requirements and are instead
designed with low pin count in mind.
EC8552- Computer Architecture And Organization
An inter-integrated circuit (Inter-IC or I2C) is a multi-master serial bus that connects low-
speed peripherals to a motherboard, mobile phone, embedded system or other electronic
devices.
Philips Semiconductor created I2C with an intention of communication between chips
reside on the same Printed Circuit Board (PCB).
It is a multi-master, multi-slave protocol.
It is designed to lessen costs by streamlining massive wiring systems with an easier
interface for connecting a central processing unit (CPU) to peripheral chips in a television.
It had a battery-controlled interface but later utilized an internal bus system.
It is built on two lines
SDA (Serial Data) – The line for the master and slave to send and receive data
SCL (Serial Clock) – The line that carries the clock signal.
Devices on an I2C bus are always a master or a slave. Master is the device which always
initiates a communication and drives the clock line (SCL). Usually a microcontroller or
microprocessor acts a master which needs to read data from or write data to slave
peripherals.
Slave devices are always responds to master and won╆t initiate any communication by itself.
Devices like EEPROM, LCDs, RTCs acts as a slave device. Each slave device will have a
unique address such that master can request data from or write data to it.
The master device uses either a 7-bit or 10-bit address to specify the slave device as its
partner of data communication and it supports bi-directional data transfer.
EC8552- Computer Architecture And Organization
Working of I2C
The I2C, data is transferred in messages, which are broken up into frames of data. Each
message has an address frame that contains the binary address of the slave, and one or
more data frames that contain the data being transmitted.
The message also includes start and stop conditions, read/write bits, and ACK/NACK bits
between each data frame.
The following are the bits in data frames:
1. Start Condition: The SDA line switches from a high voltage level to a low voltage level
before the SCL line switches from high to low.
2. Stop Condition: The SDA line switches from a low voltage level to a high voltage level after
the SCL line switches from low to high.
3. Address Frame: A 7 or 10 bit sequence unique to each slave that identifies the slave when
the master wants to talk to it.
4. Read/Write Bit: A single bit specifying whether the master is sending data to the slave
(low voltage level) or requesting data from it (high voltage level).
5. ACK/NACK Bit: Each frame in a message is followed by an acknowledge/no-acknowledge
bit. If an address frame or data frame was successfully received, an ACK bit is returned to
the sender from the receiving device.
I2C doesn╆t have slave select lines like SP), so it needs another way to let the slave know that
data is being sent to it, and not another slave. It does this by addressing. The address frame
is always the first frame after the start bit in a new message.
EC8552- Computer Architecture And Organization
The master sends the address of the slave it wants to communicate with to every slave
connected to it. Each slave then compares the address sent from the master to its own
address.
If the address matches, it sends a low voltage ACK bit back to the master. If the address
doesn╆t match, the slave does nothing and the SDA line remains high.
Read/Write Bit
The address frame includes a single bit at the end that informs the slave whether the master
wants to write data to it or receive data from it. If the master wants to send data to the
slave, the read/write bit is a low voltage level. If the master is requesting data from the
slave, the bit is a high voltage level.
Data Frame
After the master detects the ACK bit from the slave, the first data frame is ready to be sent.
The data frame is always 8 bits long, and sent with the most significant bit first.
Each data frame is immediately followed by an ACK/NACK bit to verify that the frame has
been received successfully.
The ACK bit must be received by either the master or the slave (depending on who is
sending the data) before the next data frame can be sent.
After all of the data frames have been sent, the master can send a stop condition to the slave
to halt the transmission.
The stop condition is a voltage transition from low to high on the SDA line after a low to
high transition on the SCL line, with the SCL line remaining high.
3. Each slave compares the address sent from the master to its own address. If the address
matches, the slave returns an ACK bit by pulling the SDA line low for one bit. If the address
from the master does not match the slave╆s own address, the slave leaves the SDA line high.
4. The master sends or receives the data frame.
5. After each data frame has been transferred, the receiving device returns another ACK bit to
the sender to acknowledge successful receipt of the frame.
6. To stop the data transmission, the master sends a stop condition to the slave by switching
SCL high before switching SDA high.
Advantages
It uses two wires.
This supports multiple masters and multiple slaves.
ACK/NACK bit gives confirmation that each frame is transferred successfully.
Well known and widely used protocol
Disadvantages
Slower data transfer rate than SPI.
The size of the data frame is limited to 8 bits
More complicated hardware needed to implement than SPI.
4.7 MASS STORAGE
Mass storage refers to various techniques and devices for storing large amounts of data.
Mass storage is distinct from memory, which refers to temporary storage areas within the
computer. Unlike main memory, mass storage devices retain data even when the computer
is turned off.
tape drives
RAID storage
USB storage
flash memory cards
Solid State Devices
Solid-state devices are electronic devices in which electricity flows through solid
semiconductor crystals like silicon, gallium arsenide, and germanium rather than through
vacuum tubes.
It do not involve any moving parts or magnetic materials.
RAM is a solid state device that consists of microchips that store data on non-moving
components, providing for fast retrieval of that data.
Transistors are the most important solid state devices. The transistors contain two p– n
junctions, have three contacts or terminals.
They require the action of perpendicular electrical fields, their behavior is more difficult to
understand than that of diodes.
The different types of transistors are: bipolar junction transistor (BJT) where the current is
amplified, while in the field effect transistor (FET) a voltage controls a current.
In a solid-state component, the current is confined to solid elements and compounds
engineered specifically to switch and amplify it.
Current flows in two forms: as negatively charged electrons, and as positively charged
electron deficiencies called holes.
In some semiconductors, the current consists mostly of electrons; in other semiconductors,
it consists mostly of holes. Both the electron and the hole are called charge carriers.
Hard Drives
A hard disk drive is a non-volatile memory hardware device that permanently stores and
retrieves data on a computer.
A hard drive is a secondary storage device that consists of one or more platters to which
data is written using a magnetic head, all inside of an air-sealed casing.
EC8552- Computer Architecture And Organization
Internal hard disks reside in a drive bay, connect to the motherboard using an ATA, SCSI, or
SATA cable, and are powered by a connection to the power supply unit.
Optical Drives
An Optical Drive refers to a computer system that allows users to use DVDs, CDs and Blu-
ray optical drives.
The drive contains some lenses that project electromagnetic waves that are responsible for
reading and writing data on optical discs.
An optical disk drive uses a laser to read and write data. A laser in this context means an
electromagnetic wave with a very specific wavelength within or near the visible light
spectrum.
An optical drive that works with all types of discs will have two separate lenses: one for
CD/DVD and one for Blu-ray.
An optical drive has a rotational mechanism to spin the disc. Optical drives were originally
designed to work at a constant linear velocity (CLV) (i.e.) the disc spins at varying speeds
depending on where the laser beam is reading, so the spiral groove of the disc passes by the
laser at a constant speed.
An optical drive also needs a loading mechanism: A tray-loading mechanism, where the
disc is placed onto a motorized tray, which moves in and out of the computer case and slot-
loading mechanism, where the disc is slid into a slot and motorized rollers are used to
move the disc in and out.
Tape disks
A tape drive is a device that stores computer data on magnetic tape, especially for backup
and archiving purposes.
EC8552- Computer Architecture And Organization
Tape drives work either by using a traditional helical scan where the recording and
playback heads touch the tape, or linear tape technology, where the heads never actually
touch the tape.
Drives can be rewinding, where the device issues a rewind command at the end of a
session, or non-rewinding.
Rewinding devices are most commonly used when a tape is to be unmounted at the end of a
session after batch processing of large amounts of data.
Non-rewinding devices are useful for incremental backups and other applications where
new files are added to the end of the previous session╆s files.
The different types of tapes are audio, video and data storage tape.
Redundant Array of Inexpensive Disks (RAID) Storage
RAID is a way of storing the same data in different places on multiple hard disks to protect
data in the case of a drive failure.
RAID works by placing data on multiple disks and allowing input/output (I/O) operations
to overlap in a balanced way, improving performance. Because the use of multiple disks
increases the mean time between failures (MTBF), storing data redundantly also increases
fault tolerance.
A RAID controller can be used as a level of abstraction between the OS and the physical
disks, presenting groups of disks as logical units. Using a RAID controller can improve
performance and help protect data in case of a crash.
Levels in RAID:
1. RAID 0 (Disk striping):
RAID 0 splits data across any number of disks allowing higher data throughput. An
individual file is read from multiple disks giving it access to the speed and capacity of all of
them. This RAID level is often referred to as striping and has the benefit of increased
performance.
2. RAID 1 (Disk Mirroring):
RAID 1 writes and reads identical data to pairs of drives. This process is often called data
mirroring and it╆s a primary function is to provide redundancy. If any of the disks in the array
fails, the system can still access data from the remaining disk(s).
EC8552- Computer Architecture And Organization
Flash Drives
A flash drive stores data using flash memory. Flash memory uses an electrically erasable
programmable read-only (EEPROM) format to store and retrieve data.
Flash drives are non-volatile, which means they do not need a battery backup.
Most computers come equipped with USB ports, which detect inserted flash drives and
install the necessary drivers to make the data retrievable.
Computer users can store and retrieve data once the operating system has detected a
connection to the USB port.
Flash drives have a USB mass storage device classification, which means they do not require
additional drivers.
The computer╆s operating system recognizes a block-structured logical unit, which means it can use
any file system or block addressing system to read the information on the flash
drive.
A flash drive enters emulation mode, or acts a hard drive, once it has connected to the USB
port. This makes it easier to transfer data between the flash drive and the computer.
Flash memory is known as a solid state storage device, meaning there are no moving parts
— everything is electronic instead of mechanical.
The mechanical action of the switch causes some vibration, called bounce, which the
processor filters out.
If the key is pressed and held continuously, the processor recognizes it as the equivalent of
pressing a key repeatedly.
Another type of keyboard has three layers: top plasticized layer with key positions marked
on the top surface and conducting traces on another side; middle layer made of rubber with
hole for key positions; bottom metallic layer with raised bumps for key positions.
When a key is pressed the trace underneath the top layer comes in contact with the bump in
the last layer, thus completing an electrical circuit. The current flow is sensed by the
microcontroller.
Scanners
Scanners operate by shining light at the object or document being digitized and directing
the reflected light onto a photosensitive element.
In most scanners, the sensing medium is an electronic, light-sensing integrated circuit
known as a charged coupled device (CCD).
Light-sensitive photo sites arrayed along the CCD convert levels of brightness into
electronic signals that are then processed into a digital image.
A scanner consists of a flat transparent glass bed under which the CCD sensors, lamp,
lenses, filters and also mirrors are fixed.
The document has to be placed on the glass bed. There will also be a cover to close the
scanner.
The lamp brightens up the text to be scanned. Most scanners use a cold cathode fluorescent
lamp (CCFL).
A stepper motor under the scanner moves the scanner head from one end to the other. The
movement will be slow and is controlled by a belt.
The scanner head consists of the mirrors, lens, CCD sensors and also the filter. The scan
head moves parallel to the glass bed and that too in a constant path.
As the scan head moves under the glass bed, the light from the lamp hits the document and
is reflected back with the help of mirrors angled to one another.
According to the design of the device there may be either 2-way mirrors or 3-way mirrors.
The mirrors will be angled in such a way that the reflected image will be hitting a smaller
surface.
In the end, the image will reach a lens which passes it through a filter and causes the image
to be focused on CCD sensors.
The CCD sensors convert the light to electrical signals according to its intensity.
The electrical signals will be converted into image format inside a computer.
EC8552- Computer Architecture And Organization
The beam is again blanked, and moved back to the top left to start again.
This process draws a complete picture, typically 50 to 100 times a second.
The number of times in one second that the electron gun redraws the entire image is called
the refresh rate and is measured in hertz (cycles per second).
It is common, particularly in lower priced equipment, for all the odd-numbered lines of an
image to be traced, and then all the even-numbered lines; the circuitry of such an interlaced
display need to be have only half the speed of a non-interlaced display.
An interlaced display, particularly at a relatively low refresh rate, can appear to some
observers to flicker, and may cause eye strain and nausea.
EC8552- Computer Architecture And Organization
The intensity or strength of the electron beam is controlled by setting the voltage levels.
The number of electrons that hits the screen determines the light emitted by the screen.
When the voltage is varied in the electron gun, the brightness of the display also varies.
The focusing hardware focuses the beam at all positions on the screen.
The deflection of electron beam is controlled by electric or magnetic fields.
Two pairs of coils mounted on the CRT to produce the necessary defection.
The coils are placed in such a way that, the magnetic field produced by them results in
traverse deflection force that is perpendicular to the magnetic field and electron beam.
Inkjet printers
Inkjet printers are most popular printers for home and small scale offices as they have a
reasonable cost and a good quality of printing as well.
Laser Printers
Laser printers are the most popular printers that are mainly used for large scale qualitative
printing.
They are among the most popularly used fastest printers available in the market.
A laser printer uses a slight different approach for printing. It does not use ink like inkjet
printers, instead it uses a very fine powder known as Toner.
The control circuitry is the part of the printer that talks with the computer and receives the
printing data.
A Raster Image Processor (RIP) converts the text and images in to a virtual matrix of dots.
The photo conducting drum which is the key component of the laser printer has a special
coating which receives the positive and negative charge from a charging roller.
A rapidly switching laser beam scans the charged drum line by line. When the beam flashes
on, it reverses the charge of tiny spots on the drum, respecting to the dots that are to be
printed black.
EC8552- Computer Architecture And Organization
UNIT - V
ADVANCED COMPUTER ARCHITECTURE
2.2.2 Most of the CPU design is based on the von Neumann architecture and the follow
SISD.
2.2.3 The SISD model is a non-pipelined architecture with general-purpose registers,
Program Counter (PC), the Instruction Register (IR), Memory Address Registers
(MAR) and Memory Data Registers (MDR).
Single Instruction, Multiple Data (SIMD) is an Instruction Set Architecture that have a
single control unit (CU) and more than one processing unit (PU) that operates like a
von Neumann machine by executing a single instruction stream over PUs, handled
through the CU.
The CU generates the control signals for all of the PUs and by which executes the same
operation on different data streams.
The SIMD architecture is capable of achieving data level parallelism.
EC8552- Computer Architecture And Organization
Multiple Instruction, Single Data (MISD) is an Instruction Set Architecture for parallel
computing where many functional units perform different operations by executing
different instructions on the same data set.
This type of architecture is common mainly in the fault-tolerant computers executing
the same instructions redundantly in order to detect and mask errors.
4. HARDWARE MULTITHREADING
Multithreading enables the processing of multiple threads at one time, rather than
multiple processes. Since threads are smaller, more basic instructions than processes,
multithreading may occur within processes. Threads are instruction stream with state
(registers and memory). The register state is also called thread context. Threads could be part
of the same process or from different programs. Threads in the same program share the same
address space and hence consume fewer resources.
The terms multithreading, multiprocessing and multitasking are used
interchangeably. But each has its unique meaning:
Multitasking: It is the process of executing multiple tasks simultaneously. In
multitasking, when a new thread needs to be executed, old thread╆s context in hardware
written back to memory and new thread╆s context loaded.
Multiprocessing: It is using two or more CPUs within a single computer system.
Multithreading: It is executing several parts of a program in parallel by dividing the
specific operations within a single application into individual threads.
Granularity: The threads are categorized based on the amount of work done by the thread.
This is known as granularity. When the hardware executes from the hardware contexts
determines the granularity of multithreading.
EC8552- Computer Architecture And Organization
How does the hardware scheduler interact with the software scheduler for
fairness?
What is the switching overhead vs. benefit?
Where do we store the contexts?
A trade off must be done between fairness and system throughput: Switch not only on
miss, but also on data return.
This has a severe problem because switching has performance overhead as it requires
flushing of pipeline and window; reduced locality and increased resource contention.
One possible solution is to estimate the slowdown of each thread compared to when
run alone. Then enforce switching when slowdowns become significantly unbalanced.
Advantages:
Simpler to implement, can eliminate dependency checking and branch prediction
logic completely
Switching need not have any performance overhead.
Higher performance overhead with deep pipelines and large windows
Disadvantages
Low single thread performance: each thread gets 1/Nth of the bandwidth of the
pipeline
Unused instruction slots, which arise from latencies during the pipelined execution of
single-threaded programs by a microprocessor, are filled by instructions of other
threads within a multithreaded processor.
The executions units are multiplexed among those thread contexts that are loaded in
the register sets.
Underutilization of a superscalar processor due to missing instruction-level
parallelism can be overcome by simultaneous multithreading, where a processor can
issue multiple instructions from multiple threads in each cycle.
Simultaneous multithreaded processors combine the multithreading technique with a
wide-issue superscalar processor to utilize a larger part of the issue bandwidth by
issuing instructions from different threads simultaneously.
Load Balancing:
A distinct feature in multiprocessor systems is load balancing.
EC8552- Computer Architecture And Organization
Shared memory can quickly become a bottleneck for system performances, since all
processors must synchronize on the single bus and memory access.
Obviously, programs using remote data will run much slower than what they
would, if the data were stored in the local memory. In NC-NUMA systems there is
no cache coherency problem, because there is no caching at all: each memory item
is in a single location.
Remote memory access is however very inefficient. For this reason, NC-NUMA
systems can resort to special software that relocates memory pages from one
block to another, just to maximize performances.
Caching can alleviate the problem due to remote data access, but brings the cache
coherency issue.
A method to enforce coherency is obviously bus snooping, but this techniques gets
too expensive beyond a certain number of CPUs, and it is much too difficult to
implement in systems that do not rely on bus-based interconnections.
The common approach in CC-NUMA systems with many CPUs to enforce cache
coherency is the directory-based protocol.
The basic idea is to associate each node in the system with a directory for its RAM
blocks: a database stating in which cache is located a block, and what is its state.
When a block of memory is addressed, the directory in the node where the block
is located is queried, to know if the block is in any cache and, if so, if it has been
changed respect to the copy in RAM.
EC8552- Computer Architecture And Organization
GPU is designed to lessen the work of the CPU and produce faster video and graphics.
GPU can be thought as an extension of CPU with thousands of cores. A GPU is extensively used
in a PC on a video card or motherboard, mobile phones, display adapters, workstations and
game consoles. They are mainly used for offloading computation intensive application. This is
also known as a visual processing unit (VPU).
Differences between CPU and GPU
GPU CPU
They facilitate highly parallel operations. This supports serial execution of programs.
This has more number of cores This has less number of cores.
(in thousands).
They need special faster interfaces to No such special interfaces are required.
facilitate faster data transfers.
They have deeper pipelines. They have comparatively shallow
pipelines.
EC8552- Computer Architecture And Organization
The GPU is connected to the CPU and is completely separate from the motherboard.
The RAM is connected through the Accelerated Graphics Port (AGP) or the PCI
express bus.
Sometimes, GPUs are integrated into the north bridge on the motherboard and use the
main memory as a digital storage area, but these GPUs are slower and have poorer
performance.
The accelerated memory in GPU is used for mapping vertices and can also supports
programmable shade implementing textures, mathematical vertices and accurate
color formats.
Applications such as Computer-Aided Design (CAD) can process over 200 billion
operations per second and deliver up to 17 million polygons per second.
The main configurations of GPU processor are: Graphics coprocessor which is
independent of CPU and Graphics accelerator that is based on commands from CPU.
Vertex Processing
This processes vertices performing operations like transformation, skinning and
lighting.
A vertex shade takes a single input vertex and produces a single output vertex.
Pixel Processing
Each pixel provided by triangle setup is fed into pixel processing as a set of attributes
which are used to compute the final color for this pixel.
The computations taking place here include texture mapping and math operations
Energy efficiency: Since large numbers of systems are clustered, lot of money is
invested in power distribution and for heat dissipation. Work done per joule is critical
for both WSCs and servers because of the high cost of building the power and
mechanical infrastructure for a warehouse of computers and for the monthly utility
bills to power servers. If servers are not energy-efficient they will increase
cost of electricity
cost of infrastructure to provide electricity
cost of infrastructure to cool the servers.
Dependability via redundancy: The hardware and software in a WSC must
collectively provide at least 99.99% availability, while individual servers are much
less reliable. Redundancy is the key to dependability for both WSCs and servers. WSC
architects rely on multiple cost-effective servers connected by a low cost network and
redundancy managed by software. Multiple WSCs may be needed to handle faults in
whole WSCs. Multiple WSCs also reduce latency for services that are widely deployed.
Network I/O: Networking is needed to interface to the public as well as to keep data
consistent between multiple WSCs.
Interactive and batch-processing workloads: Search and social networks are
interactive and require fast response times. At the same time, indexing, big data
analytics etc. create a lot of batch processing workloads also. The WSC workloads
must be designed to tolerate large numbers of component faults without affecting the
overall performance and availability.
Differences between WSCs and data centers
Data Centers WSCs
Data centres hosts services for multiple WSCs are run by only one client.
providers.
There will be little commonality between Homogenous hardware and software
hardware and software. management.
Third party software solutions. In-house middle ware.
EC8552- Computer Architecture And Organization
The economies of scale lead to cloud computing, since the lower per-unit costs
of WSCs lead to lower rental rates.
Even if a server had a Mean Time To Failure (MTTF) of twenty five years, the
WSC architect should design for five server failures per day.
Google uses a relaxed consistency model in that all three replicas have to eventually
match, but not all at the same time.
5.6.4 Performance
Power Utilization Effectiveness (PUE) is widely used metric to estimate the performance
of WSCs.
Dimension and size of network: It should be decided how many processing element
are there in the network and what the dimensionality of the network is i.e. with how
many neighbors, each processor is connected.
EC8552- Computer Architecture And Organization
Greater number of nodes means greater the cost of the network. It is good creation to
measure the hardware cost and the performance of the multiprocessor network and
gives more insight to design a cost-effective parallel system.
Extensibility
It is virtue which facilitates large sized system out of small ones with minimum
changes in the configuration of the nodes. It is the smallest increment by which the
system can be expanded in a useful way. A network with large number of links or a
large node degree tends to increase the hardware cost. Expandability is an important
parameter to evaluate the performance of a multiprocessor system. The feasibility to
extend a system while retaining its topological characteristics enables to design large
scale parallel systems.
The cube based architectures are widely used networks in parallel systems. They
have good topological properties such as symmetry, scalability and possess a rich
interconnection topology. The types of cube based networks are:
Binary hypercube or n-cube:
This is a loosely coupled parallel multiprocessor based on the binary n-cube
network.
An n-dimensional hypercube contains 2n nodes and has n edges per node.
In hypercube, the number of communication links for each node is a
logarithmic function of the total number of nodes.
The hypercube organization has low diameter and high bisection width at the
expense of the number of edges per node and the length of the longest edge.
EC8552- Computer Architecture And Organization
The advantage of the cube- connected cycles is that node╆s degree is always ぬ,
independent of the value of n. This architecture is modified from hypercube
i.e. a 3-cube is modified to form a 3-cube-connected cycles (CCC) restricted the
node degree to 3.
The idea is to replace the corner nodes (vertices) of the 3-cube with a ring of
3-nodes.
In general one can construct k-cube-connected cycles from a k-cube with n=2k
rings nodes.
EC8552- Computer Architecture And Organization
Crossed Cube
The Crossed Cube (CC) has the same node and link complexity as the hypercube and
has most of its desirable properties including regularity, recursive structure,
partition ability, strong connectivity and ability to simulate other architectures.
Its diameter is only half of the diameter of the hypercube.
Mean distance between vertices is smaller and it can simulate a hypercube through
dilation 2 embedding.
The basic properties of the CC, optimal routing and broadcasting algorithms are
developed.
The CC is derived from a hypercube by changing the way of connection of some
hypercube links.
The diameter of CC is almost half of that of its corresponding hypercube.
Ring (R)
This is a simple linear array where the end nodes are connected. It is equivalent to
mesh with wrap around connections.
The data transfer in a ring is normal one direction. A ring is obtained by connecting
the two terminal nodes of a linear array with one extra link.
A ring network can be uni-or bidirectional and it is symmetric with a constant.
It has a constant node degree of d=2, the diameter is N/2for a bidirectional ring and N
for unidirectional ring.
A ring network has a constant width 2.
The LEC network grows linearly and possesses some of the desirable topological properties
topological properties such as small diameter, high connecting constant node degree with
high scalability.
It has constant expansion of only two processors at each level of the extension while
preserving all the desirable topological properties.
The LEC network can maintain a constant node degree regardless of the increase in size in a
network.
The number of nodes in LEC network is 2 * n for n > 0 whereas the number of nodes in the
hypercube is 2n. The diameter of network is N. It has a constant node degree 4.