0% found this document useful (0 votes)
189 views

Unit Iv Advanced Microprocessor Notes PDF

The document discusses the architecture and features of the Intel 80286 microprocessor. It begins with a block diagram of the 80286 and describes its key components: the Address Unit, Bus Unit, Instruction Unit, and Execution Unit. It then covers the 80286's operating modes, including its real address mode where it is compatible with the 8086, and its protected virtual address mode where it supports memory management and virtual memory. The document also details the 80286's registers, flags, and memory addressing capabilities in both operating modes.

Uploaded by

Padmanaban M
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
189 views

Unit Iv Advanced Microprocessor Notes PDF

The document discusses the architecture and features of the Intel 80286 microprocessor. It begins with a block diagram of the 80286 and describes its key components: the Address Unit, Bus Unit, Instruction Unit, and Execution Unit. It then covers the 80286's operating modes, including its real address mode where it is compatible with the 8086, and its protected virtual address mode where it supports memory management and virtual memory. The document also details the 80286's registers, flags, and memory addressing capabilities in both operating modes.

Uploaded by

Padmanaban M
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 73

UNIT IV ADVANCED MICROPROCESSOR

Block diagram of Advanced Microprocessor, Memory Hierarchy, Cache memory, Virtual


memory, Paging & segmentation, Pipe lining – Pipe line hazards. Features and comparison of
80286, 80386, 80486, Pentium IV. Concept of core processor. Introduction to Power PC,
Features of Power PC601 and AMD Athlon Processor. Features and applications of Super
SPARC Processor.

BLOCK DIAGRAM OF ADVANCED MICROPROCESSOR - 80286


The 80286 is the first member of the family of advanced microprocessors with memory
management and protection abilities. The 80286 CPU, with its 24-bit address bus is able to
address 16 Mbytes of physical memory. Various versions of 80286 are available that runs on
12.5 MHz, 10 MHz and 8 MHz clock frequencies. 80286 is upwardly compatible with8086 in
terms of instruction set.

80286 has two operating modes namely:


 real address mode
 protected virtual addressmode

Real address mode:


 80286 can address upto 1Mb of physical memory address like 8086.
 The performance of 80286 is 6 times faster than the standard 8086.
 All memory management and protection mechanisms are disabled.
 80286 is object code compatible with 8086.

Protected Virtual address mode:


 80286 can address up to 16 Mb of physical memory address space and 1GB of virtual
memory address space.
 80286 works with all of its memory management and protection capabilities with the
advanced instruction set.
 80286 is source code compatible with 8086.

INTERNAL ARCHITECTURE OF 80286


The CPU contain four functional blocks
1. Address Unit (AU)

2. Bus Init (BU)

3. Instruction Unit (IU)


4. Execution Unit (EU)
1. Address Unit

The address unit is responsible for calculating the physical address of instructions and data
that the CPU wants to access. Also the address lines derived by this unit may be used to
address different peripherals.

2. Bus Unit (BU)


The physical address computed by the address unit is handed over to the bus unit (BU) of the
CPU. Major function of the bus unit is to fetch instruction bytes from the memory.
Instructions are fetched in advance and stored in a queue to enable faster execution of the
instructions. The bus unit also contains a bus control module that controls the prefetcher
module. These prefetched instructions are arranged in a 6-byte instructions queue. The 6-byte
prefetch queue forwards the instructions arranged in it to the instruction unit (IU).

3. Instruction Unit (IU)


The instruction unit accepts instructions from the prefetch queue and an instruction decoder
decodes them one by one. The decoded instructions are latched onto a decoded instruction
queue.

4.Execution Unit (EU)


The output of the decoding circuit drives a control circuit in the execution unit, which is
responsible for executing the instructions received from decoded instruction queue which
sends the data part of the instruction over the data bus.
The EU contains the register bank used for storing the data as scratch pad, or used as special
purpose registers. The ALU, the heart of the EU, carries out all the arithmetic and logical
operations and sends the results over the data bus or back to the register bank.

Register Organization of 80286

The 80286 CPU contains almost the same set ofregisters, as in 8086, namely

1. Eight 16-bit general purpose registers


2. Four 16-bit segment registers
3. Status and control registers
4. Instruction Pointer

General-Purpose Registers: Eight 16-bit general-purpose registers are used to store


arithmetic and logical operands. Four of these (AX, BX, CX, and DX) can be used either as
16-bit words or split into pairs of separate 8-bit registers.

Segment Registers: Four 16-bit special-purpose registers are used to select the segments of
memory that are immediately addressable for code, stack, and data.
Base and Index Registers: Four of the general-purpose registers can also be used to
determine offset addresses of operands in memory. Usually, these registers hold base
addresses or indexes to particular locations within a segment. Any specified addressing mode
determines the specific registers used for operand address calculations.
Status and Control Registers: The three 16-bit special-purpose registers are used to record
and control of the 80286 processor. The instruction pointer contains the offset address of the
next sequential instruction to be executed.

Flags Word Description

The flags D0, D2, D4, D6, D7 and D11 are modified according to the result of the execution
of logical and arithmetic instructions are called as status flag bits.
The bits D8 (Trap Flag) and D9 (Interrupt Flag) are used for controlling machine operation
and thus they are called control flags.

CF Carry Flag (bit D0) Set on high-order bit carry or borrow; cleared otherwise.

PF Parity Flag (bit D2) Set if low-order 8 bits of result contain an even number of 1 bit;
cleared otherwise.
AF Auxiliary Carry Flag (bit D4) Set on carry from or borrow to the lower order four bits of
AL; cleared otherwise.

ZF Zero Flag (bit D6) Set if result is zero: cleared otherwise,

SF Sign Flag (bit D7) Set equal to high-order bit of result (0 if positive, 1 if negative).

TF Trap Flag (bit D8) Once set, a single-step interrupt occurs after the next instruction
executes. TF is cleared by the single step interrupt.

IF Interrupt–enable Flag (bit D9) When set, maskable interrupts will cause the CPU to
transfer control to an interrupt vector specified location.

DF Direction Flag (bit D10) Causes string instructions to auto-decrement the appropriate
index registers when set. Clearing DF causes auto increment.

OF Overflow Flag (bit D11) Set if result is a too-large positive number or a too-small
negative number (excluding sign-bit) to fit in destination operand: cleared otherwise.

The additional fields available in 80286 flag registers are:

1. IOPL - I/O Privilege Field (bits D12 and D13)

2. NT - Nested Task flag (bit D14)

3. PE - Protection Enable (bit D16)

4. MP - Monitor Processor Extension (bit D17)

5. EM - Processor Extension Emulator (bit D18)

6. TS – Task Switch (bit D19)

IOPL - Input Output Privilege Level Flags


It is used in protected mode operation to select the privilege level for I/O devices. If the
current privilege level is higher or more trusted than the IOPL, I/O executed without
hindrance. If IOPL is lower than the current privilege level an interrupt occurs, causing the
execution to suspend.

NT - Nested Task Flag


When this flag is set, it indicates that one system task has invoked another through a CALL
instruction as opposed to a JMP.

PE - Protection Enable flag


It places the 80286 in protected mode, if set. This can only be cleared by resetting the CPU.
MP - Monitor Processor Extension
If this flag is set, allows WAIT instruction to generate a processor extension.

EM - Emulate Processor Extension Flag


If set, causes a processor extension absent exception and permits the emulation of processor
extension by CPU.

TS - Task Switch flag


If set, indicates the next instruction using extension will generate exception 7, permitting the
CPU to test whether the current processor extension is for current task.

MACHINE STATUS WORD (MSW)

 The machine status word consists of four flags – PE, MP, EM and TS of the four lower
order bits D19 to D16 of the upper word of the flag register.

 The LMSW and SMSW instructions are available in the instruction set of 80286 to write
and read the MSW in real address mode.

REAL ADDRESS MODE


• Act as a fast 8086

• Instruction set is upwardly compatible

• It address only 1 M byte of physical memory using A0-A19.

• In real addressing mode of operation of 80286, it just acts as a fast 8086.

The instruction set is upward compatible with that of 8086

The 80286 addresses only 1Mbytes of physical memory using A0- A19. The lines A20-A23
are not used by the internal circuit of 80286 in this mode. In real address mode, while
addressing the physical memory, the 80286 uses BHE along with A0- A19. The 20-bit
physical address is again formed in the same way as that in 8086.

The contents of segment registers are used as segment base addresses. The other registers,
depending upon the addressing mode, contain the offset addresses. Because of extra
pipelining and other circuit level improvements, in real address mode also, the 80286
operates at a much faster rate than 8086, although functionally they work in an identical
fashion. As in 8086, the physical memory is organized in terms of segments of 64Kbyte
maximum size.
An exception is generated, if the segment size limit is exceeded by the instruction or the data.
The overlapping of physical memory segments is allowed to minimize the memory
requirements for a task. The 80286 reserves two fixed areas of physical memory for system
initialization and interrupt vector table. In the real mode the first 1Kbyte of memory starting
from address 0000H to 003FFH is reserved for interrupt vector table. Also the addresses from
FFFF0H to FFFFFH are reserved for system initialization.

The program execution starts from FFFFH after reset and initialization. The interrupt vector
table of 80286 is organized in the same way as that of 8086. Some of the interrupt types are
reserved for exceptions, single-stepping and processor extension segment overrun, etc.

When the 80286 is reset, it always starts the execution in real address mode. In real address
mode, it performs the following functions: it initializes the IP and other registers of 80286, it
prepares for entering the protected virtual address mode.

PROTECTED VIRTUAL ADDRESS MODE (PVAM)


80286 is the first processor to support the concepts of virtual memory and memory
management. The virtual memory does not exist physically it still appears to be available
within the system. The concept of VM is implemented using Physical memory that the CPU
can directly access and secondary memory that is used as a storage for data and program,
which are stored in secondary memory initially.

The Segment of the program or data required for actual execution at that instant is fetched
from the secondary memory into physical memory. After the execution of this fetched
segment, the next segment required for further execution is again fetched from the secondary
memory, while the results of the executed segment are stored back into the secondary
memory for further references. This continues till the complete program is executed.

During the execution the partial results of the previously executed portions are again fetched
into the physical memory, if required for further execution. The procedure of fetching the
chosen program segments or data from the secondary storage into physical memory is called
swapping. The procedure of storing back the partial results or data back on the secondary
storage is called unswapping. The virtual memory is allotted per task.

The 80286 is able to address 1 G byte (230 bytes) of virtual memory per task. The complete
virtual memory is mapped on to the 16Mbyte physical memory. If a program larger than
16Mbyte is stored on the hard disk and is to be executed, if it is fetched in terms of data or
program segments of less than 16Mbyte in size into the program memory by swapping
sequentially as per sequence of execution.

Whenever the portion of a program is required for execution by the CPU, it is fetched from
the secondary memory and placed in the physical memory is called swapping in of the
program. A portion of the program or important partial results required for further execution,
may be saved back on secondary storage to make the PM free for further execution of another
required portion of the program is called swapping out of the executable program.

80286 uses the 16-bit content of a segment register as a selector to address a descriptor stored
in the physical memory. The descriptor is a block of contiguous memory locations containing
information of a segment, like segment base address, segment limit, segment type, privilege
level, segment availability in physical memory; descriptor type and segment use another task.

PIN SIGNALS OF 80286


CLK: This is the system clock input pin. The clock frequency applied at this pin is divided
by two internally and is used for deriving fundamental timings for basic operations of the
circuit. The clock is generated using 8284 clock generator.

D15-D0: These are sixteen bidirectional data bus lines.

A23-A0: These are the physical address output lines used to address memory or I/O devices.
The address lines A23 - A16 are zero during I/O transfers

BHE: This output signal, as in 8086, indicates that there is a transfer on the higher byte of the
data bus (D15 – D8) .

S1 , S0: These are the active-low status output signals which indicate initiation of a buscycle
and with M/IO and COD/INTA, they define the type of the bus cycle.

M/ IO’: This output line differentiates memory operations from I/O operations. If this signal
is it “0” indicates that an I/O cycle or INTA cycle is in process and if it is “1” it indicates that
a memory or a HALT cycle is in progress.

COD/ INTA’: This output signal, in combination with M/ IO signal and S1, S0 distinguishes
different memory, I/O and INTA cycles.

LOCK: This active-low output pin is used to prevent the other masters from gaining the
control of the bus for the current and the following bus cycles. This pin is activated by a
"LOCK" instruction prefix, or automatically by hardware during XCHG, interrupt
acknowledge or descriptor table access.

READY: This active-low input pin is used to insert wait states in a bus cycle, for interfacing
low speed peripherals. This signal is neglected during HLDA cycle.

HOLD and HLDA: This pair of pins is used by external bus masters to request for the
control of the system bus (HOLD) and to check whether the main processor has granted the
control (HLDA) or not, in the same way as it was in 8086.
INTR: Through this active high input, an external device requests 80286 to suspend the
current instruction execution and serve the interrupt request. Its function is exactly similar to
that of INTR pin of 8086.

NMI: The Non-Maskable Interrupt request is an active-high, edge-triggered input that is


equivalent to an INTR signal of type 2. No acknowledge cycles are needed to be carried out.

PEREQ and PEACK (Processor Extension Request and Acknowledgement): Processor


extension refers to coprocessor (80287 in case of 80286 CPU). This pair of pins extends the
memory management and protection capabilities of 80286 to the processor extension 80287.
The PEREQ input requests the 80286 to perform a data operand transfer for a processor
extension. The PEACK active-low output indicates to the processor extension that the
requested operand is being transferred.

BUSY and ERROR: Processor extension BUSY and ERROR active-low input signals
indicate the operating conditions of a processor extension to 80286. The BUSY goes low,
indicating 80286 to suspend the execution and wait until the BUSY become inactive. In this
duration, the processor extension is busy with its allotted job. Once the job is completed the
processor extension drives the BUSY input high indicating 80286 to continue with the
program execution. An active ERROR signal causes the 80286 to perform the processor
extension interrupt while executing the WAIT and ESC instructions. The active ERROR
signal indicates to 80286 that the processor extension has committed a mistake and hence it is
reactivating the processor extension interrupt.

CAP: A 0.047 μf, 12V capacitor must be connected between this input pin and ground to
filter the output of the internal substrate bias generator. For correct operation of 80286 the
capacitor must be charged to its operating voltage. Till this capacitor charges to its full
capacity, the 80286 may be kept stuck to reset to avoid any spurious activity.

Vss: This pin is a system ground pin of 80286.

Vcc: This pin is used to apply +5V power supply voltage to the internal circuit of 80286.

RESET: The active-high reset input pulse width should be at least 16 clock cycles. The
80286 requires at least 38 clock cycles after the trailing edge of the RESET input signal,
before it makes the first opcode fetch cycle.

MEMORY HIERARCHY

o The memory hierarchy system consists of all storage devices employed in a computer
system.
o The goal of using a memory hierarchy is to obtain the highest possible average
access speed while minimizing the total cost of the entire memory system.
o Going down the hierarchy, the following occurs:
o Decreasing cost per bit
o Increasing capacity
o Increasing access time
o Decreasing frequency of access of the memory by the processor.

o In the memory hierarchy, the registers are at the top in terms of speed of access.
o At the next level of the hierarchy is a relatively small amount of memory that can be
implemented directly on the processor chip. This memory, called a cache, holds
copies of the instructions and data stored in a much larger memory that is provided
externally.
o The processor cache is of two or more levels:
 L1(L1)cache
 Level2(L2)cache
 Level3(L3)cache
o A primary cache is always located on the processor chip. This cache is small and its
access time is comparable to that of processor registers. The primary cache is referred
to as the Level1 (L1) cache.
o A larger, and slower, secondary cache is placed between the primary cache and the
rest of the memory. It is referred to as the Level2 (L2) cache.
o Some computers have a level 3 (L3) cache of even larger size, in addition to the L1
and L2 caches. An L3 cache, also implemented in SRAM technology.
o The next level in the hierarchy is the main memory. The main memory is much larger
but slower than cache memories.
o At the bottom level in the memory hierarchy is the magnetic Disk and Tape devices.
They provide a very large amount of inexpensive memory.

CACHE MEMORY

o Cache memory, also called CPU memory, is high-speed static random access
memory.(SRAM).
o Cache memory is responsible for speedingup computer operations and processing.
o This memory is typically integrated directly into the
CPU chip or placed on a separate chip that has a separate bus interconnect with the
CPU.
o The purpose of cache memory is to store program instructions and data that are
used repeatedly in the operation of programs or information that the CPU is likely to
need next.
o The CPU can access this information quickly from the cache rather than having to get
it from computer's main memory.
o Fast access to these instructions increases the overall speed of the program.

o A cache memory system includes a small amount of fast memory and a large amount
of slow memory(DRAM). This system is configured to simulate a large amount of
fast memory.
o The cache memory system consists of the following units:
 Cache - consists of static RAM(SRAM)
 Main Memory –consists of dynamic RAM(DRAM)
 Cache Controller – implements the cache logic. This controller decides which
block of memory should be moved in or out of the cache.
CACHE LEVELS
o The processor cache is of two or more levels :
 Level 1 (L1) cache
 Level 2 (L2) cache
 Level 3 (L3) cache

o A primary cache is always located on the processor chip. This cache is small and its
access time is comparable to that of processor registers. The primary cache is referred
to as the Level 1 (L1) cache.

o A larger, and slower, secondary cache is placed between the primary cache and the
rest of the memory. It is referred to as the Level 2 (L2) cache.

o Some computers have a Level 3 (L3) cache of even larger size, in additionto the L1
and L2 caches.

TYPES OF CACHE
Two types of cache exists. They are
Unified cache and Split cache
Unified cache : Data and instructions are stored together(Von Neumann Architecture)
Split cache: Data and instructions are stored separately(Harvard architecture)

LOCALITY OF REFERENCE
o Cache memory is based on the property known as “locality of reference”.
o Locality of reference, also known as the principle of locality, is the tendency of
a processor to access the same set of memory locations repetitively over a short
period of time.
o The locality of reference has been implemented in two ways :
 Temporal Locality
Temporal Locality means that a recently executed instruction is likely
to be executed again very soon.
 Spatial Locality
Spatial Locality means that instructions in close proximity to a recently
executed instruction are also likely to be executed soon.
CACHE PERFORMANCE
o If a process needs some data, it first searches in the cache memory.
o If the data is available in the cache, this is termed as a cache hit and thedata is
accessed as required.
o If the data is not in the cache, then it is termed as a cache miss.
o Then the data is obtained from the main memory.

HIT RATIO

o The performance of the cache is measured in terms of a quantitycalled Hit


ratio.
o It is the number of cache hits divided by the total cache accesses.

AMAT (AVERAGE MEMORY ACCESS TIME)

o Average memory access time (AMAT) is the average time to access memory
considering both hits and misses and the frequency of different accesses.
AMAT = Hit time + (Miss Rate x Miss Penalty)
where
Hit Time - Time to hit in the cache.
Miss Penalty - Cost of a cache miss in terms of timeMiss Rate -
Frequency of cache misses

Problem 1:
Consider the following details - 1 cycle hit cost , 10 cycle miss penalty (11 cycles total
for a miss) and the Program has 10% miss rate .Calculate the AMAT.
Solution :
AMAT = Hit time + (Miss Rate x Miss Penalty) = 1.0 + (0.1 x 10) = 2.0

Problem 2:
If a direct mapped cache has a hit rate of 95%, a hit time of 4 ns, and a misspenalty
of 100 ns, what is the AMAT?

Solution :
AMAT = Hit time + (Miss Rate x Miss Penalty) = 4 + (0.05 x 100) = 9 ns

Problem 3:
If replacing the cache with a 2-way set associative increases the hit rate to 97%, but
increases the hit time to 5 ns, what is the new AMAT?
Solution :
AMAT = Hit time + (Miss Rate x Miss Penalty) = 5 + (0.03 x 100) = 8 ns

Problem 4 :
Suppose that in 1000 memory references there are 40 misses in L1 cache and 10 misses
in L2 cache. If the miss penalty of L2 is 200 clock cycles, hit time ofL1 is 1 clock cycle,
and hit time of L2 is 15 clock cycles. What will be the average memory access time?
Solution :
L1 hit ratio = (1000 – 40) / 1000 = 0.96
L2 hit ratio = (1000 – 10) / 1000 = 0.99
Average memory access time = 0.96 x 1 + 0.04 [0.99 x 15 + 0.01 x 200]
= 1.634
Problem 5 :

Problem 6 :
The application program in a computer system with cache uses 1400 instruction
acquisition bus cycle from cache memory and 100 from main memory. What is the hit
rate ? If the cache memory operates with zero wait state and the main memory bus cycles
use three wait states, what is the average number of wait states experienced during the
program execution ?
Solution :
MEASURING AND IMPROVING CACHE PERFORMANCE

MEASURING CACHE PERFORMANCE

CPU time can be divided into the clock cycles that the CPU spends executing the
program and the clock cycles that the CPU spends waiting for the memory system.

Memory-Stall Cycles
Memory-stall clock cycles can be defined as the sum of the stall cycles comingfrom
reads plus those coming from writes.

Read-Stall Cycles
The read-stall cycles can be defined in terms of the number of read accesses per program,
the miss penalty in clock cycles for a read, and the read miss rate.

Write-Stall Cycles
For a write-through scheme, we have two sources of stalls: write misses, which usually
require that we fetch the block before continuing the write and write buffer stalls, which,
occur when the write buffer is full when a write occurs. Thus, the cycles stalled for
writes equals the sum of these two.

Combining Read /Write Stall Cycles

Read and Write stall cycles can be combined by using single miss rate and miss penalty
(the write and read miss penalties are the same, i.e. time to fetch a block from Main
Memory).
It can also be written as :

IMPROVING CACHE PERFORMANCE


There are two different techniques for improving cache performance. They are
1. Reducing the miss rate by more flexible block replacement
 Set-associative cache
 Fully associated cache
2. Reducing the miss penalty by an additional level
 Multilevel caching

SOLVED PROBLEMS - CACHE PERFORMANCE

Problem 1:
Assume the miss rate of an instruction cache is 2% and the miss rate of the data cache is
4%. If a processor has a CPI of 2 without any memory stalls and the miss penalty is 40
cycles for all misses, determine how much faster a processor would run with a perfect
cache that never missed. Assume the frequency of all loads and stores is 36%.
Solution :

Problem 2 :
Suppose that clock rate of the machine used in the previous example is doubled but the
memory speed, cache misses, and miss rate are same. How much faster the machine be
with the faster clock?
Solution:
Problem 3 :
Suppose we have a 500 MHz processor with a base CPI of 1.0 with no cache misses.
Assume memory access time is 200 ns and average cache miss rate is 5%. Compare
performance after adding a second level cache, with access time 20 ns, that reduces miss
rate to main memory to 2%.

Solution:

CACHE ARCHITECTURE / POLICY

o Caches have two architecture characteristics ,

Read architecture and Write architecture.


 The read architecture may be either “Look Aside” or “Look Through.”
 The write architecture may be either “Write-Back” or “Write-Through.”
o Read Architecture :
 Look Aside - CPU requests memory from cache and main memory
simultaneously. If the data is in the cache then it is returned, otherwise the CPU
waits for the data from the main memory.
 Look Through - CPU request memory from the cache. Only if the data is not
present in the cache ,then the main memory is queried.
o Write Architecture :
 Write Back - Data in the cache is compared to the data in the main memory. Data
is written only if there is a difference.
 Write Through - When data is stored back to memory it is written to cache and
main memory at the same time.
ELEMENTS OF CACHE DESIGN

CACHE OPERATIONS

o The processor generates the read address (RA) of a word to be read.


o If the word is contained in the cache, it is delivered to the processor.
o Otherwise, the block containing that word is loaded into the cache, and the word is
delivered to the processor.
o When a cache hit occurs, the data and address buffers are disabled and
communication is only between processor and cache, with no system bus traffic.
o When a cache miss occurs, the desired address is loaded onto the system bus and the
data are returned through the data buffer to both the cache and the processor.
CACHE READ CACHE WRITE

CACHE MAPPING FUNCTIONS


o Cache mapping defines how a block from the main memory is mapped to the cache
memory in case of a cache miss.
OR
o Cache mapping is a technique by which the contents of main memory are brought into
the cache memory.
o The correspondence between the main memory blocks and cache is specified by a
“Mapping Function”.
o When a processor issues a Read request, a block of words is transferred from the
main memory to the cache, one word at a time.
o When the program references any of the location in the block, the desired contents are
read directly from the cache.
o Mapping functions determine how memory blocks are placed in the cache.

Notes:
 Main memory is divided into equal size partitions called as blocks or frames.
 Cache memory is divided into partitions having same size as that of blocks called
as lines.
During cache mapping, block of main memory is simply copied to the cache and the
block is not actually brought from the main memory.

The three mapping functions:


o Direct Mapping
o Fully Associative Mapping
o K-Way Set-Associative Mapping

(1) Direct Mapping:-


 In the case of direct mapping, a certain block of the main memory would be able to
map a cache only up to a certain line of the cache.
 The total line numbers of cache to which any distinct block can map are given by the
following:

Cache line number = (Address of the Main Memory Block ) Modulo


(Total number of lines in Cache)

or
The direct mapping expression
j = i mod n

where j = cache block number


i = main memory block number
n = cache size

For example,

 Let us consider that particular cache memory is divided into a total of ‘n’ number of
lines.
 Then, the block ‘j’ of the main memory would be able to map to line number of the
cache (j mod n).

Need of Replacement Algorithm:


In direct mapping,
 There is no need of any replacement algorithm.
 This is because a main memory block can map only to a particular line of the cache.
 Thus, the new incoming block will always replace the existing block (if any) in that
particular line.
Division of Physical Address:
In direct mapping, the physical address is divided as

(2)Associative Mapping:-
In fully associative mapping,
A block of main memory can map to any line of the cache that is freely available at
that moment.
 This makes fully associative mapping more flexible than direct mapping.
 All the lines of cache are freely available.
 Thus, any block of main memory can map to any line of the cache.
 Had all the cache lines been occupied, then one of the existing blocks will have to be
replaced.

Need of Replacement Algorithm:


In fully associative mapping,
A replacement algorithm is required.
 Replacement algorithm suggests the block to be replaced if all the cache lines are
occupied.
 Thus, replacement algorithm like FCFS Algorithm, LRU Algorithm etc is employed.
Division of Physical Address:
In fully associative mapping, the physical address is divided as

(3)K-Way Set Associative Mapping:-


In k-way set associative mapping,

 Cache lines are grouped into sets where each set contains k number of lines.
 A particular block of main memory can map to only one particular set of the cache.
 However, within that set, the memory block can map any cache line that is freely
available.
 The set of the cache to which a particular block of the main memory can map
is given by-

Consider the following example of 2-way set associative mapping-

Here,

k = 2 suggests that each set contains two cache lines.


 Since cache contains 6 lines, so number of sets in the cache = 6 / 2 = 3 sets.
 Block ‘j’ of main memory can map to set number (j mod 3) only of the cache.
 Within that set, block ‘j’ can map to any cache line that is freely available at that
moment.
 If all the cache lines are occupied, then one of the existing blocks will have to be
replaced.
Need of Replacement Algorithm-
o Set associative mapping is a combination of direct mapping and fully associative
mapping.
o It uses fully associative mapping within each set.
o Thus, set associative mapping requires a replacement algorithm.
Division of Physical Address-
In set associative mapping, the physical address is divided as
Example:
Consider a cache consisting of 128 blocks of 16 words each, for total of 2048(2K) works
and assume that the main memory is addressable by 16 bit address. Main memory is 64K
which will be viewed as 4K blocks of 16 works each.
Direct Mapping:
 The simplest way to determine cache locations in which store Memory blocks is
direct Mapping technique.
 In this, block J of the main memory maps on to block J modulo 128 of the
cache.
 Thus main memory blocks 0,128,256,….is loaded into cache and is stored at block
0.
 Block 1,129,257,….are stored at block 1 and so on.
 Placement of a block in the cache is determined
from memory address.
 Memory address is divided into 3 fields, the
lower 4-bits selects one of the 16 words in a block.
 When new block enters the cache, the 7-bit cache
block field determines the cache positions in which
this block must be stored.
 The higher order 5-bits of the memory address of
the block are stored in 5 tag bits associated with its
location in cache.
 They identify which of the 32 blocks that are
mapped into this cacheposition are currently
resident in the cache.
 It is easy to implement, but not flexible.

Fully Associative Mapping:


 This is more flexible mapping method, in which main memory block can be
placed into any cache block position.
 In this, 12 tag bits are required to identify a memory block when it is resident in
the cache.
 The tag bits of an address received from the processor are compared to the tag bits
of each block of the cache to see, if the desired block is present. This is known as
Associative Mapping technique.
 Cost of an associated mapped cache is higher than the cost of direct-mapped
because of the need to search all 128 tag patterns to determine whether a block is
in cache. This is known as associative search.

2-Way Set-Associated Mapping:


 It is the combination of direct and associative mapping technique.
 Cache blocks are grouped into sets and mapping allow block of main memory reside
into any block of a specific set.
 For a cache with two blocks per set. In this case, memory block 0, 64, 128,…..,4032
map into cache set 0 and they can occupy any two blockwithin this set.
 Having 64 sets means that the 6 bit set field of the address determines which set of
the cache might contain the desired block.
 The tag bits of address must be associatively compared to the tags of the two blocks
of the set to check if desired blockis present. This is two way associative search.
COMPARISION BETWEEN MAPPING TECHNIQUES

SOLVED PROBLEMS - CACHE MAPPING FUNCTIONS

Problem 1 :

Problem 2 :
Consider a 64-byte cache with 8 byte blocks, an associativity of 2 and LRU block replacement.
Virtual addresses are 16 bits. The cache is physically tagged. The processor has 16KB of
physical memory. What is the total number of tag bits?

Solution :
 The cache is 64-bytes with 8-byte blocks, so there are 8 blocks.
 The associativity is 2, so there are 4 sets.
 Since there are 16KB of physical memory, a physical address is 14 bits long.
 Of these, 3 bits are taken for the offset (8-byte blocks), and 2 for the index (4sets). That
leaves 9 tag bits per block.
 Since there are 8 blocks, that makes 72 tag bits.

Problem 3 :

Problem 4 :

Problem 5:
A block-set associative cache memory consists of 128 blocks divided into four blocksets. The
main memory consists of 16,384 blocks and each block contains 256 eight bit words.
1. How many bits are required for addressing the main memory?
2. How many bits are needed to represent the TAG, SET and WORD fields?

Solution
Given-

Number of blocks in cache memory = 128 Number of blocks in each set of cache = 4
Main memory size = 16384 blocks

Block size = 256 bytes 1 word = 8 bits = 1 byte

Main Memory Size


Size of main memory = 16384 blocks
= 16384 x 256 bytes = 222 bytes
Thus, Number of bits required to address main memory = 22 bits

Number of Bits in Block Offset


Block size = 256 bytes = 28 bytes
Thus, Number of bits in block offset or word = 8 bits

Number of Bits in Set Number


Number of sets in cache = Number of lines in cache / Set size
= 128 blocks / 4 blocks = 32 sets = 25 sets Thus, Number
of bits in set number = 5 bits

Number of Bits in Tag Number


Number of bits in tag = Number of bits in
physical address – (Number of bits in set
number + Number of bits in word)
= 22 bits – (5 bits + 8 bits) = 22 bits – 13 bits =9
bits .
Thus, Number of bits in tag = 9 bits.
Problem 6:
Consider a direct mapped cache with 8 cache blocks (0-7). If the memory block requests are
in the order : 3,5,2,8,0,6,3,9,16,20,17,25,18,30,24,2,63,5,82,17,24.
Which of the following memory blocks will not be in the cache at the end of the sequence?
3, 18, 20 and 30.Also, calculate hit ratio and miss ratio.

Solution
There are 8 blocks in cache memory numbered from 0 to 7. In direct mapping, a particular

block of main memory is mapped to a particular line of cache memory.


The line number is given by-

Cache line number = Block address modulo Number of lines in cache


For the given sequence-

 Requests for memory blocks are generated one by one.


 The line number of the block is calculated using the above relation.
 Then, the block is placed in that particular line.
 If already there exists another block in that line, then it is replaced.

Hit ratio = 3 / 20
Miss ratio = 17 / 20

Problem 7:
Consider a fully associative cache with 8 cache blocks (0-7). The memory blockrequests are in
the order-

4, 3, 25, 8, 19, 6, 25, 8, 16, 35, 45, 22, 8, 3, 16, 25, 7


If LRU replacement policy is used, which cache block will have memory block 7?Also, calculate
the hit ratio and miss ratio.

Solution
There are 8 blocks in cache memory numbered from 0 to 7. In fully
associative mapping, any block of main memory can be mapped to
any line of the cache that is freely available.
If all the cache lines are already occupied, then a block is replaced in
accordance with the replacement policy.

Thus,
Line-5 contains the block-7. Hit ratio = 5 / 17
Miss ratio = 12 / 17
VIRTUAL MEMORY
o Virtual memory is an architectural solution to increase the effective size of the
memory system.
o Virtual memory is a memory management technique that allows the execution of
processes that are not completely in memory.
o In some cases during the execution of the program the entire program may not be
needed.
o Virtual memory allows files and memory to be shared by two or more processes
through page sharing.
o The techniques that automatically move program and data between main memory and
secondary storage when they are required for execution is called virtual-memory
techniques.

ADVANTAGES

o One major advantage of this scheme is that programs can be larger thanphysical
memory
o Virtual memory also allows processes to share files easily and to implementshared
memory.
o Increase in processor utilization and throughput.
o Less I/O would be needed to load or swap user programs into memory

LOGICAL and PHYSICAL ADDRESS SPACE


o An address generated by the processor is commonly referred to as a logical address,
which is also called as virtual address.
o The set of all logical addresses generated by a program is a logical address space.
o An address seen by the memory unit—that is, the one loaded into the memory-
address register of the memory—is commonly referred to as a physical address.
o The set of all physical addresses is a physical address space or Memoryspace.

MEMORY MANAGEMENT UNIT

 The mapping from virtual to physical addresses is done by a 


device called the memory-management unit (MMU).
 MMU is a hardware device that maps virtual addresses to physical
addresses at run time (also called address translation hardware)
VIRTUAL TO PHYSICAL ADDRESS TRANSLATION
Each virtual address generated by the processor
contains virtual Page number and offset.
Each physical address contains Page Frame number
and offset.

Virtual to physical address translation involves


two phases:
1. Segment Translation
2. Page Translation

Segment Translation
 Segment Translation is a process of
converting a logical address(virtualaddress)
into a linear address.
 A logical address consists of a Selector
and an Offset.
 A Selector is the contents of a segment
register. A selector is used to point a
descriptor for the segment in a
table of descriptors.
 Every selector has a linear base address associated
 with it, and it is stored in the segment descriptor.
 The linear base address is then added to the offset
to generate the Linear Address.
 If paging is not enabled, then the linear address corresponds to the Physical Address.
 But if paging is enabled, then page translation should be done which translates the
linear address into physical address.
Page Translation
Page Translation is the process of converting a
linear address into a physicaladdress.
When paging is enabled ,the linear address is
broken into a virtual page number and a page
offset.
Page Table -The page table contains the
information about the main memory address
where the page is stored & the current status of the
page.
Page Frame -An area in the main memory that holds one page.

Page Table Base Register -It contains the starting address of the page table,if that
page currently resides in memory.

Control Bits in Page Table -The Control bits specifies the status of the page while it is
in main memory. There are two control bits.

They are
(i) Valid Bit – The valid bit indicates the validity of the page. If the bit is 0 , the
page is not present in main memory and a page fault occurs. If the bit is 1, the
page is in memory and the entry contains the physical page number.
(ii) Dirty Bit - The dirty bit is set when any word in a page is written.

TLB (TRANSLATION LOOKASIDE BUFFER)


o The Page table information is used by MMU for every read & write access.
o The Page table is placed in the main memory but a copy of the small portion of the
page table is located within MMU. This small portion is called Translation Look
Aside Buffer (TLB).
o This portion consists of the page table entries that corresponds to the most recently
accessed pages and also contains the virtual address of the entry.
o When the operating system changes the contents of page table, the control bit in TLB
will invalidate the corresponding entry in the TLB.
o Given a virtual address, the MMU looks in TLB for the referenced page.
o If the page table entry for this page is found in TLB, the physical address is
obtained immediately.
o If there is a miss in TLB, then the required entry is obtained from the page table in the
main memory & TLB is updated.
o When a program generates an access request to a page that is not in the main
memory, then Page Fault will occur.
o The whole page must be brought from disk into memory before an access can
proceed.
o When it detects a page fault, the MMU asks the processor to generate an interrupt.
o The processor suspends the execution of the task that caused the page fault and begin
execution of another task whose pages are in main memory because the long delay
occurs while page transfer takes place.
o When the task resumes, either the interrupted instruction must continue from the
point of interruption or the instruction must be restarted.
o If a new page is brought from the disk when the main memory is full, it must replace
one of the resident pages.
o A modified page has to be written back to the disk before it is removed from the main
memory.

PAGE FAULT & DEMAND PAGING


o If the page required by the processor is not in the main memory, page faults occur.
o The required page is loaded into the main memory from the secondary storage
memory by a special routine called page fault routine.
o This technique of getting the desired page in the main memory is called demand
paging.
o Demand paging is the process of loading the pages only when they are
demanded by the process during execution. Pages that are never
accessed are thus never loaded into physical memory.
Steps in handling a Page Fault
1. The internal table is first checked to determine whether the referencewas a
valid or an invalid memory access.
2. If the reference was invalid, the process is terminated. Otherwise, thepage must be
paged in.
3. A free frame is located, possibly from a free-frame list.
4. A disk operation is scheduled to bring in the necessary page from disk.
5. When the I/O operation is complete, the page table is updated with the new frame
number, and the invalid bit is changed to indicate that this is now a valid page
reference.
6. The instruction that caused the page fault must now be restarted fromthe
beginning.

PAGE REPLACEMENT ALGORITHMS


o If a page requested by a process is in memory, then the process can access it. If
the requested page is not in main memory, then it is page fault.
o When there is a page fault the processor decides to load the pages from the
secondary memory to the main memory. It looks for the free frame. If there is no
free frame then the pages that are not currently in use will be swapped out of
the main memory, and the desired page will be swapped into the main
memory.
o The process of swapping a page out of main memory to the swap space
and swapping in the desired page into the main memory for execution is
called as Page Replacement.

STEPS IN PAGE REPLACEMENT


1. Find the location of the desired page on the disk.
2. Find a free frame:
a. If there is a free frame, use it.
b. If there is no free frame, use a page-replacement algorithm toselect a
victim frame.
c. Write the victim frame to the disk; change the page and frametables
accordingly.
3. Read the desired page into the newly freed frame; change the pageand frame
tables.
4. Continue the user process from where the page fault occurred.
PAGE REPLACEMENT ALGORITHMS
1. FIFO page Replacement
2. Optimal Page Replacement
3. LRU Page Replacement
4. Counting Based Page Replacement Algorithm

1. FIFO PAGE REPLACEMENT


o The simplest page-replacement algorithm is a first-in, first-out (FIFO) algorithm.
o A FIFO replacement algorithm replaces the oldest page that was brought into main
memory.

EXAMPLE:

o Consider the Reference string 7, 0, 1, 2, 0, 3, 0, 4, 2, 3, 0, 3, 2, 1, 2, 0, 1,


7, 0, 1 for a memory with three frames.
o The three frames are empty initially.
o The first three references (7, 0, 1) cause page faults and are brought into these
empty frames.
o The next reference (2) replaces page 7, because page 7 was brought in first.
o Page 0 is the next reference and 0 is already in memory, we have no fault for this
reference.
o The first reference to 3 results in replacement of page 0, since it is now first in
line.
o Because of this replacement, the next reference, to 0, will fault. Page 1 is then
replaced by page 0. The process continues until all the pages are referenced.

Advantages:
o The FIFO page-replacement algorithm is easy to understand andprogram

Disadvantages:
o The Performance is not always good.
2. OPTIMAL PAGE REPLACEMENT
o Optimal page replacement algorithm Replace the page that will not beused for
the longest period of time

EXAMPLE:

o Consider the Reference string 7, 0, 1, 2, 0, 3, 0, 4, 2, 3, 0, 3, 2, 1, 2, 0, 1,


7, 0, 1 for a memory with three frames.
o The first three references cause faults that fill the three empty frames.
o The reference to page 2 replaces page 7, because page 7 will not be useduntil
reference 18, whereas page 0 will be used at 5, and page 1 at 14.
o The reference to page 3 replaces page 1, as page 1 will be the last of thethree
pages in memory to be referenced again.

Advantage:
o Optimal replacement is much better than a FIFO algorithm

Disadvantage:
o The optimal page-replacement algorithm is difficult to implement, because it
requires future knowledge of the reference string.

3. LRU PAGE REPLACEMENT


o The Least Recently used algorithm replaces a page that has not beenused for
a longest period of time.
o LRU replacement associates with each page the time of that page’s lastuse.
o It is similar to that of Optimal page Replacement looking backward intime,
rather than forward.

EXAMPLE:

o Consider the Reference string 7, 0, 1, 2, 0, 3, 0, 4, 2, 3, 0, 3, 2, 1, 2, 0, 1,


7, 0, 1 for a memory with three frames.
o The first three references cause faults that fill the three empty frames.
o The reference to page 2 replaces page 7, because page 7 has not beenused for
a longest period of time, when we look backward.
o The Reference to page 3 replaces page 1, because page 1 has not beenused for
a longest period of time.
o When the reference to page 4 occurs, however, LRU replacement seesthat, of
the three frames in memory, page 2 was used least recently.
o Thus, the LRU algorithm replaces page 2, not knowing that page 2 isabout to
be used.
o When it then faults for page 2, the LRU algorithm replaces page 3, sinceit is now
the least recently used of the three pages in memory.

Advantage:
o The LRU policy is often used as a page-replacement algorithm and is
considered to be good.
o LRU replacement does not suffer from Belady’s anomaly.

Disadvantage:
o The problem is to determine an order for the frames defined by the timeof last
use.

4.COUNTING-BASED PAGE REPLACEMENT


o We can keep a counter of the number of references that have been made to each
page.
o This method includes two schemes
 Least frequently used (LFU) page-replacement:
The least frequently used (LFU) page-replacement algorithm
requires that the page with the smallest count be replaced. The
reason for this selection is that an actively used page should have a
large reference count.
 Most frequently used (MFU) page-replacement algorithm: The most
frequently used (MFU) page-replacementalgorithm is based on
the argument that the page with thesmallest count was probably
just brought in and has yet tobe used.
SOLVED PROBLEMS – VIRTUAL MEMORY

Problem 1 :

Problem 2:

Problem 3 :
A computer system has a 36-bit virtual address space with a page size of 8K, and 4bytes per
page table entry.

1. How many pages are in the virtual address space?


2. What is the maximum size of addressable physical memory in this system?
Solution
1. A 36 bit address can address 236 bytes in a byte addressable machine. Since the size of a
page 8K bytes (213), the number of addressable pages is 236 / 213 = 223
2. With 4 byte entries in the page table we can reference 2 32 pages. Since each page is 213 B
long, the maximum addressable physical memory size is
232 X 213 = 245 B
Problem 4 :

PAGING:
Paging is the memory management technique in which secondary memory is divided into
fixed-size blocks called pages, and main memory is divided into fixed-size blocks called
frames. The Frame has the same size as that of a Page. The processes are initially in
secondary memory, from where the processes are shifted to main memory(RAM) when there
is a requirement. Each process is mainly divided into parts where the size of each part is the
same as the page size. One page of a process is mainly stored in one of the memory frames.
Paging follows no contiguous memory allocation. That means pages in the main memory can
be stored at different locations in the memory.
Advantages of paging:

o Pages reduce external fragmentation.


o Simple to implement.
o Memory efficient.
o Due to the equal size of frames, swapping becomes very easy.
o It is used for faster access of data.

SEGMENTATION:
Segmentation is another memory management technique used by operating systems. The
process is divided into segments of different sizes and then put in the main memory. The
program/process is divided into modules, unlike paging, in which the process was divided
into fixed-size pages or frames. The corresponding segments are loaded into the main
memory when the process is executed. Segments contain the program’s utility functions,
main function, subroutines, stack, array and so on.

Compaction:

 Compaction is a memory management technique in which the free space of a running


system is compacted, to reduce fragmentation problem and improve memory
allocation efficiency. Compaction is used by many modern operating systems, such as
Windows, Linux, and Mac OS X.

 As in the fig we have some used memory(black color) and some unused
memory(white color).The used memory is combined. All the empty spaces are
combined together. This process is called compaction.

 This is done to prevent to solve the problem of fragmentation, but it requires too much
of CPU time.
 By compacting memory, the operating system can reduce or eliminate fragmentation
and make it easier for programs to allocate and use memory.

The compaction process usually consists of two steps:

1. Copying all pages that are not in use to one large contiguous area.
2. Then write the pages that are in use into the newly freed space.

PIPELINING

 Pipelining (or) Instruction Pipelining is an implementation technique in


which
multiple instructions are overlapped in execution.
 Pipelining is a process of arrangement of hardware elements of the CPU such
that itsoverall performance is increased.
 The computer pipeline is divided in stages.
 The stages are connected to one another.Each stage completes a part of an
instructionin parallel.
 Pipelining is widely used in modern processors.
 Pipelining is a particularly effective way of organizing concurrent activity
in acomputer system.
 It uses faster circuit technology to build the processor and the main memory.
Advantages :
 Pipelining is a key to make processing fast.
 Pipelining improves system performance in terms of throughput.
 Pipelining makes the system reliable.
Disadvantages:
1. The design of pipelined processor is complex and costly to manufacture.
2. The instruction latency is more.
DIFFERENCE BETWEEN SEQUENTIAL EXECUTION AND PIPELINED
EXECUTION

SEQUENTIAL EXECUTION PIPELINED EXECUTION


In the Sequential Execution, the processor In Pipelined Execution, the processor
executes a program by fetching and executes a program by overlapping the
executing instructions, one after another. instructions.

PIPELINED EXECUTION / ORGANIZATION


2 - STAGE PIPELINED EXECUTION
 Execution of a program consists of a sequence of fetch and execute steps.
 Let Fi and Ei refer to the fetch and execute steps for instruction Ii.
 A computer has two separate hardware units.
 They are:
 Instruction fetch unit
 Instruction execution unit
 The instruction fetched by the fetch unit is stored in an intermediate storage
buffer.
 This buffer is needed to enable the execution unit to execute the instruction
while thefetch unit is fetching the next instruction.
 The execution results are stored in the destination location specified by the
instruction.
 The fetch and execute steps of any instruction can each be completed in one cycle.
3 - STAGE PIPELINED EXECUTION

 The stages are:


F - Fetch : Read the instruction from the memory
D - Decode : Decode the instruction and fetch the source operand(s)
E - Execute : Perform the operation specified by the instruction

4 - STAGE PIPELINED EXECUTION

 The stages are:


F - Fetch : Read the instruction from the memory
D - Decode : Decode the instruction and fetch the source operand(s)
E - Execute : Perform the operation specified by the instruction
W - Write : Store the result in the destination location
5 - STAGE PIPELINED EXECUTION

 Instruction Fetch - The CPU reads instructions from the address in the
memorywhose value is present in the program counter.
 Instruction Decode - Instruction is decoded and the register file is accessed to
get thevalues from the registers used in the instruction.
 Execute - ALU operations are performed.
 Memory Access - Memory operands are read and written from/to the memory
that ispresent in the instruction.
 Write Back – Computed value is written back to the register

6 - STAGE PIPELINED EXECUTION


PIPELINE HAZARDS

condition that causes a pipeline to stall(delay) is called a hazard.


Hazards are problems with the instruction pipeline in CPU , when the next
instructioncannot execute in the following clock cycle.

 Hazards are categorized into three types:

1. Structural Hazard – The situation when two instructions require the use of a
given hardware resource at the same time.

2. Data Hazard – Any condition in which either the source or the destination
operands of an instruction are not available at the time expected in the
pipeline.So some operation has to be delayed, and the pipeline stalls.

3. Instruction Hazard – A delay in the availability of an instruction causes the


pipeline to stall. This type of hazard occurs when the pipeline makes the
wrong decision on a branch prediction and therefore brings instructions into
the pipeline that must subsequently be discarded.

STRUCTURAL HAZARD

 A structural hazard occurs when two or more instructions that are already in
pipelineneed the same resource.
 These hazards are because of conflicts due to insufficient resources.
 The result is that the instructions must be executed in series rather than parallel
for aportion of pipeline.
 Structural hazards are sometime referred to as resource hazards.
 Example:
 A situation in which multiple instructions are ready to enter the execute
instruction phase and there is a single ALU (Arithmetic Logic Unit).
 One solution to such resource hazard is to increase available resources,
such ashaving multiple ALU.
DATA HAZARD

A data hazard occurs when there is a conflict in the access of an operand location. There
are three types of data hazards. They are
Read After Write (RAW) or True Dependency:
 An instruction modifies a register or memory location and a succeeding
instructionreads the data in that memory or register location.
 A RAW hazard occurs if the read takes place before the write operation is
complete.
 Example
I1 : R2 ← R5 + R3I2 : R4 ← R2 + R3

Write After Read (WAR) or Anti Dependency:


 An instruction reads a register or memory location and a succeeding instruction
writesto the location.
 A WAR hazard occurs if the write operation completes before the read operation
takesplace.
 Example
I1 : R4 ← R1 + R5

I2 : R5 ← R1 + R2

Write After Write (WAW) or Output Dependency:


 Two instructions both write to the same location.
 A WAW hazard occurs if the write operations take place in the reverse order of
theintended sequence.
 Example:
I1 : R2 ← R4 + R7

I2 : R2 ← R1 + R3

INSTRUCTION / CONTROL / BRANCH HAZARD


 An instruction (or) control (or) branch hazard, occurs when the pipeline makes
the wrong decision on a branch prediction and therefore brings instructions into
thepipeline that must subsequently be discarded.
 Whenever the stream of instructions supplied by the instruction fetch unit is
interrupted, the pipeline stalls.

HANDLING DATA HAZARDS (or) DATA DEPENDENCY


 Consider the two instructions:
 Add R2, R3, #100
 Subtract R9, R2, #30
 The destination register R2 for the Add instruction is a source register for the
Subtract instruction.
 There is a data dependency between these two instructions, because register R2
carries data from the first instruction to the second.
3

 There are two techniques using which we can handle data hazards.
 They are
(1) Using Operand Forwarding (2) Using Software

Handling Data Dependencies Using Operand Forwarding


 Pipeline stalls due to data dependencies can be improved through the use of
operandforwarding.
 Rather than stalling the instruction, the hardware can forward the value from result
registerto the ALU input through the Multiplexers.
 The second instruction can get data directly from the output of ALU after the
previousinstruction is completed.
 A special arrangement needs to be made to “forward” the output of ALU to the
input ofALU.
Example :
I1 : ADD R1,R2,R3I2: SUB R4,R1,R5

Handling Data Dependencies Using Software


 An alternative approach is for detecting data dependencies and
dealing with them.
 When the compiler identifies a data dependency between two
successive instructions Ij and Ij+1, it can insert three explicit NOP (No-operation)
instructions between them.
 The NOP’s introduce the necessary delay to enable instruction Ij+1 to read the
new value from the register file after it is written.

HANDLING INSTRUCTION HAZARDS (or) CONTROL HAZARDS

A variety of approaches have been taken for dealing with Instruction/Control/Branch


Hazards.(Conditional branches)
1) Multiple Streams
2) Prefetch Branch Target
3) Loop Buffer
4) Branch Prediction
5) Delayed Branch

1) MULTIPLE STREAMS
o The approach is to replicate the initial portions of the pipeline and allow the
pipeline to fetch both instructions, making use of multiple streams.
o There are two problems with this approach:
1. Contention delays for access to the registers and to memory.
2. Additional branch instructions may enter the pipeline before the original
branch decision is resolved.
2) PREFETCH BRANCH TARGET
[

o When a conditional branch is recognized, the target of the branch is prefetched,


inaddition to the instruction following the branch.
o This target is then saved until the branch instruction is executed.
o If the branch is taken, the target has already been prefetched.
3) LOOP BUFFER
o A loop buffer is a small, very-high-speed memory maintained by the instruction
fetch stage of the pipeline and containing the ‘n’ most recently fetched
instructions, in sequence.
o If a branch is to be taken, the hardware first checks whether the branch target is
withinthe buffer. If so, the next instruction is fetched from the buffer.
4) BRANCH PREDICTION
o To reduce the branch penalty, the processor needs to anticipate that an
instruction being fetched is a branch instruction and predict its outcome to
determine which instruction should be fetched.
o It is generally of two types:
 Static Branch Prediction
 Dynamic Branch Prediction
o Static Branch Prediction - Assume that the branch will not be taken and to fetch
the next instruction in sequential address order.
o Dynamic Branch Prediction - Uses the recent branch history,to see if a branch
was taken the last time this instruction was executed.

o Techniques for Branch Prediction


Various techniques can be used to predict whether a branch will be taken. The
mostcommon are the following:
 Predict never taken
 Predict always taken
 Predict by opcode
 Taken/not taken switch
 Branch history table
o Branch Prediction Buffer (or) Branch History Table

 One implementation of that approach is a branch prediction buffer or branch


history table.
 A branch prediction buffer is a small memory indexed by the lower portion of
the address of the branch instruction. The memory contains a bit that says whether
the branch was recently taken or not.
 A branch predictor tells us whether or not a branch is taken,
 Calculates the branch target address.
 Using a cache to hold the branch target buffer.

o Branch Prediction Flowchart


 If the instruction is predicted as taken, fetching begins from the target as soon
as thePC is known; it can be as early as the ID stage.
 If the instruction is predicted as not taken, sequential fetching and executing
continue.
 If the prediction turns out to be wrong, the prediction bits are changed.
o Types Of Branch Predictor
1. Correlating Predictor - Combines local behavior and global behavior of a
particular
branch.
2. Tournament Predictor - Makes multiple predictions for each branch and a
selection mechanism that chooses which predictor to enable for a given branch.
5) DELAYED BRANCH
o In MIPS, branches are delayed.
o This means that the instruction immediately following the branch is always
executed, independent of whether the branch condition is true or false. This is
known as Branch Folding Technique.
o When the condition is false, the execution looks like a normal branch.
o When the condition is true, a delayed branch first executes the instruction
immediately following the branch in sequential instruction order before jumping
to the specified branch target address.
FEATURES AND COMPARISON OF 80286, 80386, 80486, PENTIUM IV.

Features of 80286

 The Intel 80286 was introduced in early 1982. It is an x86 16-bit microprocessor
with 134,000 transistors. It was the first Intel processor that could run all the software
written for its predecessor.

 The 80286’s performance is more than twice that of its predecessors, i.e., Intel 8086
and Intel 8088 per clock cycle.

 The 80286 processors have a 24-bit address bus. Therefore, it is able to address up
to 16 MB of RAM.

 The 80286 CPU was designed to run multitasking applications, digital


communications, real-time process control systems, and multi-user systems.

 80286 processor is the first x86 processor, which can be used to operate in protected
mode. The protected mode enabled up to 16 MB of memory to be addressed by the
on-chip linear memory management unit (MMU) with 1 GB logical address space.

 80286 is a high-performance 16-bit microprocessor with on-chip memory


management and protection capabilities. This processor has been designed for a
multi-user as well as a multitasking system, allowed multiple programs to run
simultaneously without interfering with each other.
 Usually, the 80286 processor is booted in real mode, and thereafter it works in
protected mode by software command. But it is not possible to switch the 80286
from protected mode to real mode. To shift from protected mode to real mode, 80286
microprocessors must be reset.

 The 80286 with 8 MHz clock provides up to 6 times higher than the 5 MHz 8086.

 There is no on-chip clock generator circuit in 80286. Therefore, an external 82284


chip is required to generate the external clock.

 The 80286 operates in two different modes such as real mode and protected mode.
The real mode is used for compatibility with existing 8086/8088 software base, and
the protected mode is used for enhanced system level features such as memory
management, multitasking, and protection.

 The 80286 introduced several new instructions, including support for signed and
unsigned multiplication and division, as well as a new set of instructions for handling
interrupts.
Features of 80386

 The 80386 is a 32-bit microprocessor that can support 8-bit, 16-bit and 32-bit
operands. It has 32-bits registers, 32-bits internal and external data bus, and 32-bit
address bus.

 Due to its 32-bit address bus, the 80386 can address up to 4GB of physical
memory. The physical memory of this processor is organized in terms of segments
of 4 GB size at maximum.

 The 80386 CPU is able to support 16k number of segments and the total virtual
memory space is 4 giga bytes x 16k = 64 TBytes.

 Another Features of 80386 Microprocessor has a 16-byte prefetch queue.

 It is manufactured by Intel using 0.8-micron CHMOS technology.

 It is available with 275k transistors in a 132-Pin PGA package.

 80386 Microprocessor has memory management unit provides virtual memory,


paging and four levels of protection.

 It operates in real, protected and virtual real mode. The protected mode of 80386 is
fully compatible with 80286.

 The 80386 can run 8086 applications under a protected mode in its virtual 8086
mode of operation, which allows multiple programs to run simultaneously in
protected mode, while still allowing legacy programs to run in real mode.

 The 80386-instruction set is upward compatible with all its predecessors. The 80386
introduced several new instructions, including support for SIMD (Single Instruction
Multiple Data) instructions, which allow multiple operations to be performed
simultaneously on a single piece of data.

 The 80386 introduced on-chip cache memory, which allowed frequently accessed
data to be stored on the processor itself, reducing the time it takes to access that data.

 The 80386 ran at clock speeds up to 33 MHz, which was significantly faster than
the 80286's maximum speed of 12.5 MHz.

 The 80386 processor supports Intel 80387 numeric data processor.

Features of 80486

 It has complete 32-bit architecture which can support 8-bit, 16-bit and 32-bit data
types.
 8 KB unified level 1 cache for code and data has been added to the CPU. In advanced
versions of the 80486 processor, the size of level 1 cache has been increased to 16
KB.

 The 80486 is packaged in a 168-pin grid array package. The 25 MHz, 33 MHz, 50
MHz and 100 MHz ( DX-4) versions of 80486 are available in the market.

 Execution time of instructions is significantly reduced. Load, store and arithmetic


instructions are executed in just one cycle when data already exists in the cache. The
80486 introduced several new instructions, including support for hardware-based
virtual memory, which allowed operating systems to manage memory more
efficiently.

 Floating-point unit is integrated with 80486 processor. Hence the delay in


communications between the CPU and FPU has been eliminated and all floating-
point instructions are executed within very few CPU cycles.

 Intel 80486 operates at much faster bus transfers.

 This processor retains all complex instruction sets of 80386, and more pipelining has
been introduced to improve performance in speed.

 For fast execution of complex instructions, the 80486 has a five-stage pipeline. Two
out of the five stages are used for decoding the complex instructions and the other
three stages are used for execution.

 Power management and System Management Mode (SMM) of 80486 became a


standard feature of the processor.

 The 80486 had an on-chip cache memory of up to 16 kilobytes, which helped to


reduce memory access times and improve performance.

 The 80486 included power management features that allowed it to consume less
power when idle, which was an important consideration for portable computers.

 Clock-doubling and clock-tripling technology has been incorporated in faster


versions of Intel 80486 CPU. These advanced i486 processors can operate in existing
motherboards with 20-33 MHz bus frequency, while running internally at two or
three times of bus frequency.

Features of Pentium IV

 Pentium 4 is a family of high performance microprocessor was developed based on


NetBurst micro-architecture.

 The Pentium is a superscalar microprocessor, which means it can execute multiple


instructions in parallel.
 The Pentium is a 32-bit microprocessor, like the 80386 and 80486, which means it
can handle data in 32-bit chunks.

 It consists of 42 million transistors.

 Clock speed of Pentium 4 varies from 1.3 GHz to 3.8 GHz.

 It operates in hyper-pipelined technology and it has a 20-stage pipeline.

 While longer pipelines are less efficient than shorter ones, they allow CPU core to
reach higher frequencies, and thus increase CPU performance.

 To improve efficiency of very deep pipeline the Pentium 4 processors included new
features: Trace Execution Cache, Enhanced Branch prediction, and Quad Data
Rate bus.

 The instruction set of the Pentium 4 processor is compatible with x86 (i386), x 86-
64, MMX, SSE, SSE2, and SSE3 instructions. These instructions include 128-bit
SIMD integer arithmetic and 128-bit SIMD double-precision floating-point
operations.

 It has 8 KB L1 data cache and an execution trace cache to store up to 12 K decoded


micro-operations (μ-ops) in the order of program execution.

 Another features of Pentium 4 Processor is that it supports faster system bus at 400
MHz to 1066 MHz with 3.2 GB/s of bandwidth.

 The Pentium 4 processor has two arithmetic logic units (ALUs) which are operated at
twice the core processor frequency.

 It is fabricated in 0.18 micron CMOS process.

 It has advanced dynamic execution.

 It has enhanced branch prediction.

 It has a rapid execution engine.

 It has enhanced floating point/multimedia applications.

Comparison of 80286, 80386, 80486 and Pentium processors.

Specifications 80286 80386 80486 Pentium


Year introduced 1982 1985 1989 1992
Technology NMOS CMOS CMOS BICMOS
Clock Rate (MHz) 10 -16 16 - 33 25 – 33 60 - 66
16-bit 32-bit 32-bit processor 32-bit processor
Processor processor processor
Number of pins 68 132 168 273
Number of transistors 1,34,000 2,75,000 1.2 million 3.1 million
Physical memory 16 M 4G 4G 4G
Virtual memory 1G 64T 64T 64T
Had external On-chip cache On-chip cache
No on-chip cache memory upto memory upto
Cache Memory memory memory 16KB 16KB
Internal data bus 16 32 32 32
External data bus 16 32 32 64
Address bus 24 32 32 32
Data type (in bits) 8, 16 8, 16, 32 8, 16, 32 8, 16, 32
Instructions Instructions to Introduced
to support support Multimedia
Introduced hardware hardware based Instructions
several new based virtual floating point
Instruction Set instructions memory arithmetic

CONCEPT OF CORE PROCESSOR

A core, or CPU core, is the "brain" of a CPU. It receives instructions, and performs
calculations, or operations, to satisfy those instructions. A CPU can have multiple cores.
A processor with two cores is called a dual-core processor; with four cores, a quad-core; six
cores, hexa-core; eight cores, octa-core.
Each CPU core can perform operations separately from the others. Multiple cores may also
work together to perform parallel operations on a shared set of data in the CPU's
memory cache.
What is a multicore processor?

A multicore processor is an integrated circuit that has two or more processor cores attached
for enhanced performance and reduced power consumption. These processors also enable
more efficient simultaneous processing of multiple tasks, such as with parallel
processing and multithreading. A dual core setup is similar to having multiple, separate
processors installed on a computer. However, because the two processors are plugged into
the same socket, the connection between them is faster.
The use of multicore processors or microprocessors is to boost processor performance
without exceeding the practical limitations of semiconductor design and fabrication. Using
multicores also ensure safe operation in areas such as heat generation.

A multi-core processor's design enables the communication between all available cores, and
they divide and assign all processing duties appropriately. The processed data from each
core is transmitted back to the computer's main board (Motherboard) via a single common
gateway once all of the processing operations have been finished. This method beats a
single-core CPU in terms of total performance.

How do multicore processors work?


The heart of every processor is an execution engine, also known as a core. The core is
designed to process instructions and data according to the direction of software programs in
the computer's memory. Over the years, designers found that every new processor design
had limits. Numerous technologies were developed to accelerate performance, including the
following ones:

 Clock speed. One approach was to make the processor's clock faster.
 Hyper-threading. Another approach involved the handling of multiple instruction
threads called as hyper-threading. With hyper-threading, processor cores are designed to
handle two separate instruction threads at the same time.
 More chips. The next step was to add processor chips -- or dies -- to the processor
package, which is the physical device that plugs into the motherboard. A dual-core
processor includes two separate processor cores. A quad-core processor includes four
separate cores. Today's multicore processors can easily include 12, 24 or even more
processor cores.

What are multicore processors used for?


Multicore processors work on any modern computer hardware platform.

 Virtualization. A virtualization platform, such as VMware, is designed to abstract the


software environment from the underlying hardware. Virtualization is capable of
abstracting physical processor cores into virtual processors or central processing units
(vCPUs) which are then assigned to virtual machines (VMs). Each VM becomes a
virtual server capable of running its own OS and application.

 Databases. A database is a complex software platform that frequently needs to run


many simultaneous tasks such as queries. As a result, databases are highly dependent on
multicore processors to distribute and handle these many task threads.

 Analytics and HPC. Big data analytics, such as machine learning, and high-
performance computing (HPC) both require breaking large, complex tasks into smaller
and more manageable pieces.

 Cloud. Organizations building a cloud will almost certainly adopt multicore processors
to support all the virtualization needed to accommodate the highly scalable and highly
transactional demands of cloud software platforms such as OpenStack.

 Visualization. Graphics applications, such as games and data-rendering engines, have


the same parallelism requirements as other HPC applications.
Multicore advantages
 When compared to single-core processors, a multicore processor has the potential of
doing more tasks.
 Low energy consumption when doing many activities at once.
 Data takes less time to reach its destination since both cores are integrated on a single
chip.
 With the use of a small circuit, the speed can be increased.
 Detecting infections with anti-virus software while playing a game is an example of
multitasking.
 With the use of low frequency, it can accomplish numerous tasks at the same time.
 In comparison to a single-core processor, it is capable of processing large amounts of
data.

Multicore disadvantages
 Software dependent.
 Performance boosts are limited.
 Power, heat and clock restrictions.

INTRODUCTION TO POWER PC
PowerPC – Performance Optimization With Enabled RISC – Performance Computing.

PowerPC is a RISC (Reduced Instruction Set Computer) 64-bit architecture


developed jointly by IBM, Motorola and Apple and were initially designed for personal
computers, embedded systems, and high-performance computing which are very powerful
and low-cost microprocessors. RISC architecture tries to keep the processor as busy as
possible.

Design features of PowerPC are as follows:


1. RISC architecture: PowerPC microprocessors use a RISC architecture, which
simplifies the instruction set and reduces the number of clock cycles required to
execute an instruction.
2. High performance: PowerPC microprocessors are designed to deliver high
performance and are used in applications that require high computing power, such as
gaming consoles, high-end workstations, and supercomputers.
3. Multiple cores: Many PowerPC microprocessors have multiple cores, which allows
them to execute multiple instructions simultaneously and improve overall
performance.
4. Large register file: PowerPC microprocessors have a large register file, which
enables them to store and access data quickly, reducing the need for memory access
and improving performance.
5. 64-bit support: PowerPC microprocessors support 64-bit addressing, which allows
them to access large amounts of memory and process large data sets.
6. Endianness: PowerPC microprocessors can operate in either big-endian or little-
endian mode, allowing them to communicate with different types of devices and
systems.
7. Low power consumption: PowerPC microprocessors are designed to consume less
power, making them suitable for use in portable devices such as laptops, tablets, and
smartphones.
8. Support for multiple operating systems: PowerPC microprocessors are compatible
with multiple operating systems, including Mac OS X, Linux, and AIX.
PowerPC machine Architecture:

The PowerPC machine architecture is organized into several layers, each with its own set of
functions and responsibilities. These layers include:

1. Processor hardware: This layer includes the microprocessor chip and its associated
components, such as the cache, memory management unit, and bus interface.
2. Operating system interface: This layer provides the interface between the processor
hardware and the operating system software. It includes system calls, interrupt
handling routines, and other low-level functions that the operating system needs to
interact with the processor hardware.
3. Operating system kernel: This layer provides the core functions of the operating
system, such as memory management, process scheduling, and device driver support.
4. Application programming interface (API): This layer provides a set of functions
and libraries that developers can use to write applications for the PowerPC
architecture. The API includes standard C and C++ libraries, as well as platform-
specific libraries that provide access to hardware features such as graphics and sound.
5. Applications: This layer includes the user-facing applications that run on top of the
operating system. These can range from simple command-line utilities to complex
graphical applications such as web browsers and video games.
 Memory:
Memory consists of 8-bit bytes. PowerPC programs can be written using a Virtual
Address Space (264 bytes). Address space are divided into fixed-length segments
which are further divided into pages.
 Registers:
There are 32 general-purpose registers (GPR) from GPR0 to GPR31. Length of each
register is 64-bit. The general purpose register are used to store and manipulate data
and addresses. As PowerPC machine support floating point data format so it have
Floating-point unit (FPU) for computation.

 Some of the register’s supported by PowerPC architecture are:

Register Operations
Link Register (LR) Contain address to return at the end of the function call
Condition Register (CR) Signify the result of an instruction
Count Register (CTR) For loop count
 Data Formats:
 Integers are stored as 8-, 16-, 32-, or 64-bit Binary numbers.
 Characters are represented using 8-bit ASCII codes.
 Floating points are represented using two different floating-point formats
namely single-precision format and double-precision format.
 Instruction Formats:

PowerPC support seven basic instruction formats. All of these instruction formats
are 32-bits long. PowerPC architecture instruction format have more variety and
complexity as compared to other RISC systems such as SPARC. Bit numbering for
PowerPC is the opposite of most other definitions:
 bit 0 is the most significant bit, and
 bit 31 is the least significant bit
 Instructions are first decoded by the upper 6 bits in a field, called the primary
opcode. The remaining 26 bits contain fields for operand specifiers, immediate
operands, and extended opcodes, and these may be reserved bits or fields.
 Addressing Mode:
Load and store operations use one of the following three addressing mode
depending upon the operand value:
Mode Target address(TA) calculation
Register indirect TA=(register)
Register indirect with index TA=(register-1) + (register-2)
Register indirect with immediate index TA=(register) + displacement

Branch instructions use one of the following three addressing modes:


Mode Target address(TA) calculation
Absolute TA=actual address
Relative TA=current instruction address + displacement
Link Register TA=(LR)
Count Register TA=(CR)

 Instruction Set:

PowerPC architecture is more complex than the other RISC systems. Thus
PowerPC architecture has approximately 200 machine instructions. This
architecture follows the pipeline execution of instructions which means while one
instruction is executed next one is being fetched from memory and decoded.

 Input and Output:

PowerPC architecture follows two different methods for performing I/O operations.
In one approach Virtual address space is used while in the other approach I/O is
performed using Virtual memory management.
Features of Power PC601
The PowerPC 601 is a microprocessor chip that was designed by IBM and Motorola in the
early 1990s. Here are some of its features:

1. Architecture: The PowerPC 601 is based on the RISC (Reduced Instruction Set
Computing) architecture, which allows for efficient and fast processing of
instructions.
2. Clock speed: The PowerPC 601 had a clock speed of 50 MHz, which was
considered fast for its time.
3. Pipeline depth: The PowerPC 601 had a pipeline depth of 5 stages, which allowed it
to process instructions quickly.
4. Cache memory: The PowerPC 601 had a 32 KB level 1 (L1) cache and a 256 KB
level 2 (L2) cache, which helped improve its performance.
5. Bus interface: The PowerPC 601 had a 64-bit bus interface, which allowed for fast
data transfer between the processor and memory.
6. Instruction set: The PowerPC 601 supported the PowerPC instruction set, which
included both 32-bit and 64-bit instructions.
7. Floating-point performance: The PowerPC 601 had a high-performance floating-
point unit, which allowed it to perform complex mathematical calculations quickly.

Overall, the PowerPC 601 was a powerful microprocessor for its time, and it was used in a
variety of applications, including Apple's Power Macintosh computers.

AMD ATHLON PROCESSOR


The AMD Athlon processor powers the next generation in computing platforms, delivering
the ultimate performance for cutting-edge applications and an unprecedented computing
experience.

The AMD Athlon processor family features the industry's first seventh-generation x86
microarchitecture, which is designed to support the growing processor and system
bandwidth requirements of emerging software, graphics, I/O, and memory technologies.

The AMD Athlon processor's nine-issue superpipelined microarchitecture includes


multiple full x86 instruction decoders, a high-performance cache architecture, three
independent integer units, three address calculation units, and the x86 industry's first
superscalar, fully pipelined, out-of-order, three-way floating-point unit.
The floating-point unit is capable of delivering 4 gigaflops (Gflops) of single-precision and
more than 2 Gflops of double-precision floating-point results at 1 GHz, for superior
performance on numerically complex applications.

AMD’s Enhanced 3DNow! technology includes additional instructions to the popular


3DNow! instruction set. It consists of new integer multimedia instructions and software-
directed data movement instructions for optimizing such applications as digital content
creation and streaming video for the internet, as well as new instructions for digital signal
processing (DSP)/communications applications.

The AMD Athlon processors are implemented in AMD’s advanced 0.18-micron process
technology to achieve maximum performance and scalability.

ARCHITECTURE OF AMD ATHLON PROCESSOR


The principal elements of the Athlon core included:
 Multiple Decoders: Three full x86 instruction decoders translate x86 instructions
into fixed-length MacroOPs for higher instruction throughput and increased
processing power. Instead of executing x86 instructions, which have lengths of 1 to
15 bytes, the Athlon processor executes the fixed-length MacroOPs, while
maintaining the instruction coding efficiencies found in x86 programs.
 Instruction Control Unit: Once MacroOPs are decoded, up to three MacroOPs per
cycle are dispatched to the instruction control unit (ICU). The ICU is a 72-entry
MacroOP reorder buffer (ROB) that manages the execution and retirement of all
MacroOPs, performs register renaming for operands, and controls any exception
conditions and instruction retirement operations. The ICU dispatches the MacroOPs
to the processor’s multiple execution unit schedulers.
 Execution Pipeline: The Athlon contains an 18-entry integer/address generation
MacroOP scheduler and a 36-entry floating-point unit (FPU)/multimedia scheduler.
These schedulers issue MacroOPs to the nine independent execution pipelines – three
for integer calculations, three for address calculations, and three for execution of
MMX, 3DNow!, and x87 floating-point instructions.
 Superscalar FPU: AMD’s previous CPUs were poor floating-point performers
compared with Intel’s. This previous weakness has been more than adequately
addressed in the Athlon, which features an advanced three-issue superscalar engine
based on three pipelined out-of-order execution units (FMUL, FADD, and FSTORE).
The term superscalar refers to a CPU’s ability to execute more than one instruction
per clock cycle, and while such processors have existed for some time now, the
Athlon represents the first application of the technology to an FPU subsystem. The
superscalar performance characteristic of the Athlon’s FPU is partly down to
pipelining – the process of pushing data and instructions into a virtual pipe so that the
various segments of this pipe can process the operations simultaneously. The bottom
line is that the Athlon is capable of delivering as many as four 32-bit, single-
precision floating-point results per clock cycle, resulting in a peak performance of
2.4 Gflops at 600MHz.

 Branch Prediction: The AMD Athlon processor offers sophisticated dynamic


branch prediction logic to minimise or eliminate the delays due to the branch
instructions (jumps, calls, returns) common in x86 software.
 System Bus: The Athlon system bus is the first 200MHz system bus for x86
platforms. Based on the Digital’s Alpha EV6 bus protocol, the frontside bus (FSB) is
potentially scaleable to 400MHz and beyond and, unlike the shared bus SMP
(Symmetric Multi-Processing) design of the Pentium III, uses a point-to-point
architecture to deliver superior bandwidth for uniprocessor and multiprocessor x86
platforms.
 Cache Architecture: Athlon’s cache architecture is a significant leap forward from
that of conventional sixth-generation CPUs. The total Level 1 cache is 128KB – four
times that of the Pentium III – and the high-speed 64-bit backside Level 2 cache
controller supports between 512KB and a massive 8MB.
 Enhanced 3DNow!: In response to Intel’s Pentium III Streaming SIMD Extensions,
the 3DNow! implementation in the Athlon has been upgraded, adding 24 new
instructions to the original 21 3DNow! instructions – 19 to improve MMX integer
math calculations and enhance data movement for Internet streaming applications
and 5 DSP extensions for soft modem, soft ADSL, Dolby Digital, and MP3
applications.

FEATURES OF AMD ATHLON PROCESSOR


AMD Athlon is a brand of CPUs (central processing units) produced by Advanced Micro
Devices (AMD). Some features of the AMD Athlon processor:

1. High Clock Speeds: AMD Athlon processors offer high clock speeds, which means
they can execute instructions quickly.
2. Multiple Cores: Many AMD Athlon processors have multiple cores, which allows
for better multitasking and overall performance.
3. Cache Memory: The AMD Athlon processors come with varying levels of cache
memory, which is used to temporarily store frequently accessed data for quick
access.
4. HyperTransport Technology: AMD Athlon processors use HyperTransport
technology, which allows for high-speed communication between the processor and
other components in the computer.
5. 64-bit Architecture: The AMD Athlon processors are designed with 64-bit
architecture, which allows for more memory to be used and better performance in
certain applications.
6. AMD Virtualization Technology: Some AMD Athlon processors feature AMD
Virtualization technology, which allows for better performance in virtualized
environments.
7. Overclocking: Many AMD Athlon processors can be overclocked, which means
running the processor at higher speeds than it was designed for, to achieve even
better performance.

APPLICATIONS OF AMD ATHLON PROCESSOR


All AMD Athlon processors provide industry-leading processing power for cutting-edge
software applications, including digital content creation, digital photo editing, digital video,
image compression, video encoding for streaming over the internet, soft DVD, commercial
3D modeling, workstation-class computer-aided design (CAD), commercial desktop
publishing, and speech recognition.
1. Gaming: AMD Athlon processors are popular among gamers due to their high clock
speeds, multiple cores, and support for modern gaming technologies like DirectX 12
and Vulkan.
2. Productivity: AMD Athlon processors can handle multitasking with ease, making
them a good choice for productivity tasks such as office work, video editing, and
graphic design.
3. Home Entertainment: AMD Athlon processors can power home entertainment
systems, such as media center PCs and home theater systems, providing high-quality
video and audio performance.
4. Server Applications: AMD Athlon processors are used in servers, providing reliable
performance and support for virtualization technologies.
5. Embedded Systems: AMD Athlon processors can be used in embedded systems
such as point-of-sale systems, kiosks, and digital signage, where they can provide a
good balance of performance and power efficiency.

FEATURES AND APPLICATIONS OF SUPER SPARC PROCESSOR.

SPARC (derived from Scalable Processor ARChitecture) is a RISC (Reduced Instruction


Set Computing) ISA (Instruction Set Architecture) developed by Sun Microsystems.
SPARC has been implemented in processors used in a range of computers from laptops to
supercomputers such as enterprise servers. They run operating systems like Solaris,
OpenBSD and NetBSD.

SPARC was one of the most successful early commercial RISC systems, and its success led
to the introduction of similar RISC designs from many vendors through the 1980s and
1990s. The first implementation of the original 32-bit architecture (SPARC V7) was used in
Sun's Sun-4 computer workstation and server systems, replacing their earlier Sun-3 systems
based on the Motorola 68000 series of processors.

SPARC has become a widely used architecture for hardware used with UNIX-based
operating systems, including Sun's own Solaris systems.

Characteristics of SPARC Processor

• Reduces the number of instructions the processor must perform.


• Reduces the number of types of memory addresses the processor needs to handle.
• Provides language compilers that are optimized for a SPARC microprocessor.
• Puts very little processor operation in microcode, which requires clock speed-
consuming time to access.

Advantages of SPARC Processor

• The SPARC structure is scalable and adaptable, both in terms of cost and capacity.
• SPARC incorporates object-oriented programming (OOP) features.
• It is versatile, with numerous possibilities for commercial, aerospace, military, and
technical applications.
• SPARC is highly scalable and open source.

Disadvantages of SPARC Processor

• It is vulnerable to misuse by individuals because it is an open architecture.


• SPARC cannot be used for educational purposes.
• Only computer architects and developers use it to manage server applications and
lower-level programming.

SuperSPARC PROCESSOR

The SuperSPARC processor is a microprocessor designed and manufactured by Sun


Microsystems. It was released in 1992 and was the successor to the SPARC processor. The
SuperSPARC was designed to improve performance over the original SPARC by increasing
clock speed and adding new instructions.

Features of SuperSPARC PROCESSOR

1. Architecture: The SuperSPARC processor is a 32-bit superscalar processor, which


means it can execute multiple instructions in parallel. It has a five-stage pipeline and
can execute up to four instructions per cycle.
2. Clock speed: The SuperSPARC processor was initially introduced in 1992 and had a
clock speed of 33 MHz. Later versions of the processor had clock speeds of up to 50
MHz.
3. Performance: The SuperSPARC processor was a significant improvement over its
predecessors in terms of performance. It could deliver up to four times the
performance of the SPARCstation 2, which was Sun's most popular workstation at
the time. It was often used in high-performance computing applications, such as
scientific computing and computer-aided design.
4. Cache: The SuperSPARC processor had a large on-chip cache, which helped to
improve performance by reducing the number of times the processor needed to
access main memory. The SuperSPARC processor had a primary cache of 16 KB and
a secondary cache of up to 4 MB.
5. Instruction set: The SuperSPARC processor was based on the SPARC V8
instruction set architecture, which was an enhancement over the previous SPARC V7
architecture. It added support for hardware multiplication and division, as well as
improved floating-point performance.
6. Memory management: The SuperSPARC processor had a built-in memory
management unit (MMU) that provided virtual memory support and protected
memory.
7. End of life: The SuperSPARC processor was eventually phased out by Sun
Microsystems in favor of newer processors such as the UltraSPARC and the
SPARC64. However, it remains a significant milestone in the development of the
SPARC architecture.

Architecture of SuperSPARC PROCESSOR

The chip, with an optional second-level cache controller, is targeted at a wide range of
systems from uniprocessor desktop machines to multiprocessing file and compute servers.
Applications of SuperSPARC processor are as follows:
1. Workstations: SuperSPARC was primarily used in Sun's high-end workstations,
such as the SPARCstation 20 and SPARCstation 10. These workstations were
popular in engineering and scientific fields where high-performance computing was
necessary.
2. Servers: SuperSPARC was also used in Sun's servers, such as the Sun Enterprise
3000 and Sun Enterprise 4000. These servers were used in businesses and
organizations that required high levels of computing power and reliability.
3. Database management: SuperSPARC was particularly well-suited for database
management applications. Its high performance and ability to handle large amounts
of data made it a popular choice for businesses that relied heavily on databases.
4. Scientific computing: SuperSPARC was also used in scientific computing
applications, such as weather forecasting and climate modeling. Its ability to handle
complex calculations made it well-suited for these types of applications.
5. Graphics and visualization: SuperSPARC was used in graphics and visualization
applications, such as computer-aided design (CAD) and 3D modeling. Its high
performance and ability to handle large data sets made it ideal for these applications

Overall, SuperSPARC was a versatile microprocessor chip that was used in a variety of
applications that required high-performance computing, reliability, and scalability.

You might also like