0% found this document useful (0 votes)
31 views

Ca Part 4

Uploaded by

Subhadip Das
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
31 views

Ca Part 4

Uploaded by

Subhadip Das
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 25

POPULAR PUBLCATIONS

FLYNN'STAXONOMY OF COMPUTER
ARCHITECTURE

Chapter at a Glance
Flynn's
The most classification
popular taxonomy of computer architecture was defined by Flynn in 1966. Flynn's
information
classification scheme is based on the notion of a stream of information. Two tyYpes of
defined as the sequence
flow into a processor: instructions and data. The instruction stream is
the data traf
instructions performed by the processing unit. The data stream is defined as
Flynn's classification
Cxchanged between the memory and the processing unit. According to architecture can he
Computer
either of the instruction or data streams can be single or multiple.
classified accordingly into the following four distinct categories:
Single-instruction single-data streams (SISD)
Single-instruction multiple-data streams (SIMD)
Multiple-instruction single-data streams (MISD) Multiple-instruction multiple-data
streams (MIMD)

Different Types of SIMD Models


Synchronous array of parallel processors is called an array processor, which consists of multiple
processing elements (PEs) under the supervision of one control unit (CU). An array processor can
handle single instruction and multiple data (SIMD) streams. In this sense, array processors are also
known as SIMD computers. SIMD machines are especially designed to perform vector
computations over matrices or arrays of data. There are two different types of SIMD models:
Distributed memory model
Shared memory model

Shared Memory Models


In shared memory SIMD model instructions are stored by the host to control memory and data are
directly stored to the shared memory through data bus. Aray control unit is attached to the control
memory. This array control unit separates scalar instructions and vector instructions. Scalar
instructions are transferred to the scalar processor for executions, Vector instructions are
transferred to different PEs by broadcast bus.

CA-104
COMPUTER ARCHITECTURE

Multiple Choice Type Questions.


Advantage of MMX technology lies in WBUT 2010]
1. a) Multimedia application
b) VGA
c) CGA d) none of these
Answer:(a)

2. Array Processor is present in WBUT 2010]


a) SIMD b) MISD c) MIMD d) none of these
Answer: (a)

2 Which one of the following has no practical usage? WBUT 2010, 2014, 2016]
a) SISD b) SIMD c) MISD d) MIMD
Answer: (c)

4.The expression for Amdahl's law is WBUT 2011, 2016, 2017]


a) S(n)=/f where n’o b) S(n)=f where n’0
c) S(n)=1/T where n’o d) None of these
Answer: (c)
5. Which MIMD systems are best according to scalability with respect to the
number of processors? WBUT 2011]
a) Distributed memory computers b) ccNUMAsystems
c) nccNUMA systems d)Symmetric multiprocessors
Answer: (a)
6. Superscalar processors have CPI of [WBUT 2011, 2017]
a) less than 1 b)greater than 1 c) more than 2 d)greater than 3
Answer: (a)
7. The main memory of a computer has 2 cm blocks while the cache has 2 c
blocks. If the cache uses the set associative mapping scheme with 2 blocks per
set; then block k of the main memory maps to the set WBUT 2011, 2016]
a) (k mod m)of the cache b) (k mod c) of the cache
c) (k mod 2c)of the cache d) (k mod 2m) of the cache
Answer: (a)
8. As the bus in a multiprocessor is a shared resource, so there must be some
mechanism to resolve the conflict. The algorithm fom the below
mentioned is not a conflict resolution technique. WBUT 2016]
a) state priority algorithm b) FIFO algorithm
c) LRU algorithm d) Daisy Chaining algorithm
Answer: (a)
9. The MMX technology uses [WBUT 2018]
a) Pipelining technique b) Vectorizing technique
CA-105
POPULAR PUBLICATIONS
c) SIMD technique d) MIMD technique
Answer: (c)
Short Answer Type Questions
1. Discuss Flynn's classification of parallel computers.
WBUT 2006, 2007, 2009,,
Describe Flynn's classification of computer architecture.
OR, 2010, 2017)
OR, wBUT 2012, 2019]
Explain in brief with neat diagrams the Flynn's classifications of computers.
OR,
wBUT 2013, 2018
Explain Flynn's classification.
Answer: WBUT 2016]
The four classifications defined by Flynn are based upon the
number of concurrent
instruction (or control) and data streams available in the architecture:
Single Instruction, Single Data stream (SISD)
A sequential computer which exploits no parallelism in either the
instruction or data
streams. Examples of SISD architecture are the traditional uni-processor
PC or old mainframes. machines like a

Instruction Memory Coatrol Unit Processing Unit Dala Meriory

Instruction Stream Data Stream


Fig: SISD Architecture
Single Instruction, Multiple Data streams (SIMD)
A computer which exploits multiple data streams against a single
instruction stream to
perform operations which may be naturally parallelized. For example,
or CPU. an array processor
Processiag Unit Data Meory

Instructioa Memory Control Unit


Processing Unit Data Menory

Iostruction Stream
Procesing Unit Data Menory

Data Stream
Fig: SIMD Architecture

CA-106
COMPUTER ARCHITECTURE
Multiple Instructions, Single Data stream
(MISD)
Multiple instructions operate on a single data
generally used for fault tolerance. stream. Uncommon architecture which is
stream and must agree on the result. Heterogeneous systems operate on the same data
computer. Examples include the Space Shuttle flight control
Instruction Memory Conrol Unit

Instruction Stream

Instruction Memory Control Unit Processing Unit Data Memory

Instruction Stream
Data Stream

Instruction Memory Control Unit

Instruction Stream
Fig: MISD Architecture

Multiple Instructions, Multiple Data streams (MIMD)


Multiple autonomous processors simultaneously exécuting different instructions on
different data. Distributed systems are generally recognized to be MIMD architectures;
either exploiting asingle shared memory space or a distributed memory space.
Instruction Memory Control Unit Processing Unit Data Memory

Instruction Stream Data Stream

Instruction Memory Control Unit Processing Unit Data Memory

Instruction Stream Data Stream

Instruction Memory Control Unit Processing Unit Data Memory

Instruction Stream Data Stream


Fig: MIMD Architecture

CA-107
POPULAR PUBLICATIONS
2. Implement the data routing logic of SIMD architecture to compute
s(k) =Ai for k =0,1,2..N -1. WBUT 2008]
OR,
Why do we need masking mechanism in SIMD array processors? WBUT 2015]
In an SIMDarray processor of &PEs, the sum S(k) of the first k components
vector A is desired for each k from 0 to 7.
in!
Let A=(4,Ai.., d, ). We need to compute the following and throughput.
s(k) = 4; for k=0,1,..,.7
Discuss how data-routing and masking are performed in the processor.
Answer:
WBUT 2015]
Masking technique for a SIMD processor is capable of masking a plurality of individual
machine operations within a single instruction incorporating a plurality of operations. To
accomplish this each different machine operation within the instruction includes a
number of masking bits which address a specific location in a mask register. The mask
register includes a mask bit bank. The mask location selected within the mask register is
bit-wise ANDed with a mask context bit in order to establish whether the processing
element will be enabled or disabled for a particular conditional sub-routine which is
called.

We show the execution details of the following vector instruction in an array of N


processing elements (PEs) to illustrate the necessity of data routing in an array processor.
Here the sum S(k) of the first k components in a vector A is preferred for each k from 0
to n-1.
Now, A=(Ao, A, .., An-i)
So, the followingn summations are,
s(k) = Ai for k=0, 1,2..N -1.
These n vector summations can be computed recursively by going through the following
n-1 iterations:
S(0) = Ag
S(k) = S(k-l) + Ag for k= 1, 2,3..., n-1
The above recursive summations for the case of n = 8 are implemented in an
array
processor with N =8PES in log 2n =3steps as shown in the figure below. At first each
A, is transfer to the R, register in PE; for I=0, 1,..n-1. Now in step 1, A; is routed from
R. to Ri and added to A with the resulting sunm A,+ Ajin Ri+
In step 2, the intermediate sums in R, are routed to Ri2 for I=forI=0, 1,...,6
0 to 5. In step 3, the
intermediate sums in R are routed to Ris for i=0 to 3. Similarly,
of S(K)for k =0,1,2,..,7 in the last column of the figure below. PE, has the final value

CA-108
COMPUTER ARCHITECTURE

PE 0 A0 S(0)

A1 S(1)
PE 1 0,1 0,1 0,1

PE 2 S(2)
1,2 0-2 0-2

PE3 A3 2,3 0-3 0-3 S(3)

: PE4 S4)
3,4

PES 2-5 0-S S(5)


4,5

PE 6 A6 3-6 0-6 S(6)


5,6

PE 7 4-7 O-7
6,7

Step 1 Step 2 Step 3


k

Fig: The calculation of the summation s(k)=)Ai for k =0,1,2...N -1 in an SIMD machine
j=0

3. A 50 MHz processor was used to execute a program with the following


instruction mix and clock cycle counts:
Instruction Type Instruction Count Clock Cycle Count
Integer Arithmetic 50000 2
Data Transfer 70000 3
1
Floating point 25000
arithmetic
Branch 4000 2
Calculate the effective CPI, MIPS rate for this program. WBUT 2011]
Answer:
We know,
CPU time = Instruction Count (IC) * Clock pre Instruction (CPI) * Clock Cycle Time
(CCT)
-}CPI,* 1,*CcT
CA-109
POPULAR PUBLICATIONS
where, I, =Number of times the ith instruction is executed in a program.
CPI, =Number of clock cycles for the i" instruction.
The average value of clock Per Instruction (CPI Jis given by,

CPI ~CPI,"1,
=
IC i=l

where = frequency of occurrence of ih instruction in the program.


IC
CPU time =IC*CPI +CCT
IC 1 Clock rate
MIPS =
CPUtime x 10 CPI xCCT ×10° CPL x10°
Given, Clock rate = 50 MHz
CPI,, = .500 * 2 +.700*3 +.250*1 +.40*2 = 4.15
MIPS =(50 * 10 )/(4.15 * 10) = 12.05
4. What is the instruction level parallelism? WBUT 2014]
OR,
Discuss the techniques to achieve instruction level parallelism. [WBUT 2018]
Answer:
Instruction-level parallelism (ILP) is a measure of how many of the operations in a
computer program can be performed simultaneously. The potential overlap among
instructions is called instruction level parallelism. A goal of compiler and processor
designers is to identify and take advantage of as much ILP as possible. Ordinary
programs are typically written under a sequential execution model where instructions
execute one after the other and in the order specified by the programmer. ILP allows the
compiler and the processor to overlap the execution of multiple instructions or even to
change the order in which instructions are executed.

Long Answer Type guestions


1. Differentiate between multiprocessors and multicomputer based on their
structures, resource sharing and inter processor communication. WBUT 2007]
Answer:
A multicomputer comprises a number of von Neumann computers, or nodes, linked by an
interconnection network. Each computer executes its own program. This program may
access local memory and may send and receive messages over the network. Messages are
used to communicate with other computers or, equivalently, to read and write remote
memories. In the idealized network, the cost of sending a message between two nodes 1s
independent of both node location and other network traffic., but does depend on messag°
length.

CA-110
COMPUTER ARCHITECTURE

of
A defining attribute the multicomputer model is th¡t accesses to local (same-node)
memory are less expensive than accesses to remote (different-node) memory. That is,
read and
write are less costly than send and receive. Hence, it is desirable that accesses to
Jocal data be more frequent than accesses to remote data. This property, called
locality, is
lodfundamental requirement for parallel software, in addition to concurrency and
scalability.
Another important class of parallel computer is the multiprocessor or shared-memory
MIMD computer. In multiprocessors, all processors share access to a common memory,
typically via a bus or a hierarchy of buses. In the idealized Parallel Random Access
Machine (PRAM) model, often used in theoretical studies of parallel algorithms, any
processor can access any memory element in the same amount of time. In practice,
particular,
scalingthis architecture usually introduces some form of memory hierarchy; in storing
the frequency with which the shared memory is accessed may be reduced byAccess to
conies of frequently used data items in a cache associated with each processor.
this cache is much faster than access to the shared memory.
architecture.
2. a) Describe the distribution and shared memory model of SIMD element.
b) Draw the block diagram and explain the functionality of processing [WBUT 2008]
Answer:
described below based on the
a) There are twO types of SIMD computer models are
Model and
memory distribution and addressing scheme used. One is Distributed-Memory
another is Shared-Memory Model. Most SIMD computers use a single control unit and
distributed memories, except for a few that use associative memories. The instråction set
elements
of an- SIMD computer is decoded by the array control unit. The processing from the
(PEs) in the SIMD array are passive ALUs executing instructions broadcast
control unit.
Distributed-Memory Model: A distributed-memory SIMD computer consists of an
array of PEs which are controlled by the same array control unit, as shown in Fig: 1.

CA-111
POPULAR PUBLICATIONS
Mass Storage
Scalar
Processor

Network
Scalar Instruction
Host
Control Control Memory I/O
Array (Program and Data) Computer
Control Unit Instr. (User)
Vector
Instructions Broadcast Bus
(Instructions and
Constants) PEn Data
PEI PE2
Bus

LM2 LMn
LMI PE: Processing
Element

LM: Local
Data Routing Network Memory

Fig: 1Distributed -Memory Model SIMD architecture


Program and ata are loaded into the control memory through the host computer.
An instruction is sent to the control unit for decoding. If it is a sçalar or program control
operation, it will be directly executed by a scalar processor attached to the control unit. If
the decoded instruction is a vector operation, it will be broadcast to all the PEs for
parallel execution.
Partitioned data sets are distributed to all the local memories attached to the PEs through
a vector data bus. The PEs are interconnected by a data-routing network which performs
inter-PE data communications such as shifting, permutation, and other routing operations.
The data-routing network is under program control through the control unit. The PEs are
synchronized in hardware by the control unit. Almost all SIMD machines built today are
based on the distributed-memory model. Iliac IV,CM-2 are examples of Distributed
Mermory SIMD architecture.

Shared-Memory Model: In Fig: 2 we show a variation of the SIMD computer using


shared memory among the PEs. An alignment network is used as the inter-PE memory
communication network. Again this network is controlled by the control unit. Tne
alignment network must be properly set to avoid access conflicts. Some SIMD computets
use bit-slice PEs i.e. Shared-Memory Model. Example,
DAP610 and CMI 200.

CA-112
COMPUTER ARCHITECTURE

Mass
Storage
Control Memory Scalar Scalar
Processor
Host
Array Control Unit Instr.
I/O (User)
Broadcast Bus
Network (Vector Instructions)
Control PE, PE; PEn

Alignment Network

SM; SM: SMm-! SMm


Data Bus

Fig: 2Shared -Memory Model SIMD architecture


arithmetic logic
b) An array processor is a synchronous parallel computer with multipleperform the same
units, called processing elements (PE). The PEs are synchronized to
function at same time.
To CU

A1 B C
To other PEs via the
interconnecion
PE network
D; RË

ALU

For i =0, 1,..., N-1

PEM;

Fig: 3 Components of a Processing Element (PE)


FEscan establish an appropriate data routing mechanism. Each PE consists of an ALU
Wth registers and local memory. The PEs are interconnected by a data-routing network.
nere are set of local registers and flags, Ai, Bi, C; and S; are present in a PE. The data
CA-113
POPULAR PUBLICATIONS
routing register is R. address register is D; and a local index register is I;. When data
uaister process occurs in PEs, then contains of the data routing register is transferred.
3. What is the main difference and similarities between multi-computer and
Multiprocessor? Give the architecture for a typical MIMD processor? Explain the
shared memory modes of MIMD. WBUT 2011]
OR,
Briefly discuss MIMD architecture. WBUT 2012, 2014]
OR,
What is the difference and similarities between multi-computer and
multiprocessor? WBUT 2014]
OR,
What are the differences between shared memory multiprocessor system and
message passing muiti-computer system? WBUT 2018]
Answer:
A parallel machine model is called the multicomputer system. A multicomputer
comprises a number of von Neumann computers, or nodes, linked by an interconnection
network. Each computer executes its own program. This program may access Iocal
memory and may send and receive messages over the network. Messages are used to
communicate with other computers or, equivalently, to read and write remote memories.
A shared-memory MIMD computer is called the multiprocessor computer. In
muliprocessors, allprocessors share access to acommon memory, typically via a bus or
a hierarchy of buses and any processor can access any memory element in the same
amount of time. Examples of this class of machine include the Silicon Graphics
Challenge, Sequent Symmetry, and the many multiprocessor workstations.
MIMD (multiple instruction, multiple data) is a technique to achieve parallelism.
Machines using MIMD have a number of processors that function asynchronously and
independently. At any time, different processors may be executing different instructions
on different pieces of data. MIMD machines can be of either shared memory or
distributed memory categories. These clasifications are based on how MIMD processors
access memory. Shared memory machines may be of the bus-based. Distributed nemory
machines may have hypercube or mesh interconnection schemes. MIMD machines with
shared memory have processors which share a common, central memory. In the simplest
form, all processors are attached to a bus which connects them to memory. MIMD
machines with hierarchical shared memory use a hierarchy of buses to give processors
access to each other's memory. Processors on different boards may communicate through
inter-nodal buses. Buses support communication between boards. With this type of
architecture,the machine may support over a thousand processors.
4. Why do we need parallel processing? What are different levels of parallel
processing? Explain. [WBUT 2015]
Answer:
In computers, parallel processing is the processing of program instructions by dividing
them among multiple processors with the objective of running a program in less time. In
CA-114
COMPUTER ARCHITECTURE

the earliest computers, only one program ran at a time. Acomputation-intensive program
thattook one hour to run and a tape copying program that took one hour to run would
take a total of two hours to run. An early form of parallel processing allowed the
interleaved execution of both programs together. The computer would start an
operation, and while it was waiting for the operation to complete, it would execute the
processor-intensive program. The total execution time for the two jobs would be a little
over one hour.

Levels of parallelprocessing:
We can have parallel processing at four levels.
Instruction Level: Most processors have several execution units and can execute
everal instructions (usually machine level) at the same time. Good compilers can reorder
inctructions to maximize instruction throughput. Often the processor itself can do this.
Modern processors even parallelize execution of micro-steps of instructions within the
same pipe.
iD Loop Level: Here, consecutive loop iterations are candidates for parallelexecution.
However, data between subsequent iterations may restrict parallel execution of
instructions at loop level. There is a lot of scope for parallel execution at loop level.
i) Procedure Level: Here parallelism is available in the form of parallel executable
procedures. Here the design of the algorithm plays amajor role. For example each thread
in Java can be spawned to run a function or method.

iv) Program Level: This is usually the responsibility of the operating system, which runs
processes concurrently. Different programs are obviously independent of each other.
So parallelism can be extracted by the operating system at this level.
5. Write short notes on the following:
a) Array processor [WBUT 2005, 2007, 2010]
b) MMX Technology WBUT 2005, 2006, 2007]
c)CM-2 machine [WBUT 2008]
d) Flynn's classification [WBUT 2011]
Answer:
a) Array processor:
The SIMD-1 Array Processor consists of a Memory, an Array Control Unit (ACU) and
the One-dimensional SIMD array of simple processing elements (PEs). The figures show
a4-processor array. The figures shows the initial image seen when the model is loaded.

CA-115
POPULAR PUBLICATIONS
aock
MEMORY
Pbsa

PC main, 0

Array Contrd Unit cc


AC-IR NOP
PE-IR NOP

PEC

SIMD Aray

cycles in which they are active ek


The system operates on a two phase clock. In clock and sends out a result
unit executes its internal actions in the first phase of the clockinstruction or operand in
packet in the second phase. The Memory, for example, reads an phase.
the first phase and sends its output to the ACU in the second
The ACUis a simple load/store, register-register arithmetic processor. It has 16 general
purpose registers, a Program Counter (PC),Counter a Condition code Register (CC) and an
Instruction Register (AC-IR). The Program has two fields: label and offset. The
ACUalso uses two other
label field is initially set to "main" and the offset to zero. The
and the Processing
registers, the Processing Element Instruction Register (PE-IR) to communicate with the
Element Control register (PEC) which are global registers usedstep, i.e. each active PE
SIMD Array. The Processing Elements operate in lock at the same time.
(determnined by the state of its PEC bit) obeys the same instruction
the new ACC value to
Whenever a PE ACCis updated by a PE instruction, the PE sends
each of its neighbors.
reverses the order of the values
When first loaded, the model contains a program which
in memory locations 0 and 2 of the Processing Elements (initially in locations 0 and
held
results in location 1 and 3 of each of their
2 of each of their memories) and leaves the
memories.

b) MMX Technology:
Architecture (IA) designed to improve
MMX technology is an extension to the Intel Pentium processor with
performance of multimedia and communication algorithms. The
instruction set.
MMX Technology is the first microprocessor to implement the new
MMX consists of two main processor architectural improvements.
Operation of MMX Technology over the non-MMX Pentiun
The MMX technology consists of several improvements
microprocessors: to
instructions added th0se have been designed
1. There are S57 new microprocessor use
MMX
handle video, audio, and graphical data more efficiently. Programs can
instructions without changing to a,new mode or operating-system visible state.
CA-116
COMPUTER ARCHITECTURE

New 64-bit integer data type is also added to MMX Technique.


2. process, Single Instruction Multiple Data (SIMD), makes it possible for one
A new to perform the same operation on multiple data items.
3. instruction
memory cache on the microprocessor has increased to 32 KB, meaning fewer
4 The
accesses to memory that is off tthe microprocessor.

AII MMX chips have a larger internal Ll cache than their non-MMX counterparts. This
improvesthe performance of any software running on the chip, regardless of whether it
actually uses the MMX-specific instructions
or not.

processor with MMX implementation was the design of a new, dedicated,


The Pentium
high- performance MMX pipeline, which was able to execute two MMX instructions
the existing units. In addition, the design goal was to stay
with minimal logic changes in instructions., the
n the microprocessors' performance curve. With the addition of new new
instniction decode logic had to be modified to decode, schedule and issue the
instructions at a rate of up to two instructions per clock.

Frequency Speedup
the pipeline of the Pentium
To simplify the design and to meet the core frequency goal, (length
processor w/MMX was extended with a new pipeline stage decode). In order to
due to
maintain and improve the CPI (Clock per Instruction) of MMX technology is
modifications that increase the Clock Rate.
As.we know,

Execution Time = (No. of instructions). (CPI). (Clock Cycle Time)


turn decreases
i.e., increasing the Clock Rate decreases the Clock Cycle Time, which in
Execution Time. So, in order to increase Clock Rate, the MMX Pentium designers need
instruction
to find and eliminate some bottlenecks. The two major bottlenecks were the Here
decoder and the data cache access. So they tried to fix the decoder bottleneck first.
Execute,
is an instruction that uses old 5-stage pipe like Fetch, Decodel, Decode2,
Write-Back.
was also
To speed things up, a 6th stage was added to the pipe i.e. Prefetch. A queuelooks like:
added between Fetch and Decodel to decouple freezes. So now an instruction
Prefetch, Fetch, Decodel, Decode2, Execute, Write-Back as shownadvantage in the figure
to take of the
velow. After adding this new stage, machine timing is rebalanced
extra clock cycle.

CA-117
POPULAR PUBLICATIONS
Prefetch | Fetch D1 D2 Execute |Writeback
Munit va
BTB Shadow reg.
CROM
RSB
FPU
FP registers
Code Instr. Adr.
cache Len. deood calc,
16K decod and
op. Integer exec
FIFO read
TLB
assod Dcache TLB
16K fassoc
Bus unit Page
unit
IPC

Fig: Block diagram ofthe Pentium Processor with MMX technology


Although adding a pipeline stage improves frequency, it decreases CPIperformance, i.e.
the longer the pipeline, the more work done speculatively by the machine and therefore
more work is being thrown away in the case of branch miss-prediction. The additional
pipeline stage costs decreased the CPI performance of the processor by 5-6%.
c) CM-2 machine:
The CM-2 was SIMD architecture based machine. The PEs in the CM-2 was capable of
performing bit-serial arithmetic. The control procesor, or sequencer could decompose an
8-bit operation, for example, into 8 PE nano-instructions. The CM-2 provides the
mechanism for the programmer to assign PEs to groups that will execute at different
times. This functionality is achieved through the use of PE instruction masking.
Although the PEs and the PE module floating point accelerator provide extensive
processing capability, the programming paradigm was stilllimited to SIMD. The CM-2
distinguishes itself from its predecessors through the use of systematic inclusion of error
detecting and errOr-correcting circuits within the memories and communication networks.
The CM-2 is capable of achieving a peak processing speed of around 10GFlops.
The CM-2 machine provides the hypercube connections between different processing
elements (PEs). The PEs were organized into modules each having 32 PEs. Within
given module, the PEs were organized into to 16-PE sets with each set having its own
router node. Allthe PEs within a given set use shared memory to communicate with one
another by writing values into their respective local memories. Each router node
represented a vertex in the hypercube. One interesting feature of the routers was that they
provided special circuitry for message combining for messages with the same destination.
In addition to the communication via local memories of PES within a given module, the
CM-2 also supports patterned communications directly across the wires of the hypercube.

CA-118
COMPUTER ARCHITECTURE

from / to Front End Computer

Sequencer
Global result bus
Scalar Memory bus
Instruction Broadcast bus

Combine

Processors

M M M M
Memories
Router /News /S.anning

VOConroller JO Controller Frame Buffer

VO Bus I/O Bus Frame Buffer out


Fig:Block diagram of CM-2 machine

d) Flynn's classification: Refer to Question No.l of Short Answer Type


Questions.

CA-119
POPULAR PUBLICATIONS
RISC & CISC ARCHITECTURES
T Chapter at a Glance
Non von Neumann architecture characteristics
Any computer architecture in which the underlying model of computation is different from what
has come to be called the standard von Neumann model. A non von Neumann machine may thus
be without the concept of sequential flow of control (i.e. without any register corresponding to a
program counter" that indicates the current point that has been reached in execution of a
programn) and/or without the concept of a variable (i.e. without "named" storage locations in which
a value may be stored and subsequently referenced or changed). Examples of non von Neumann
machines are the dataflow machines and the reduction machines. In both of these cases there is a
high degree of parallelism, and instead of variables there are immutable bindings between names
and constant values.

Cluster Computer
A cluster computer consists of a set of loosely connected computers that work together so that in
many respects they can be viewed as a single system. The components of a cluster are usually
conected to each other through fast local area networks, each node (computer used as a server)
running its own instance of an operating system. Computer clusters emerged as a result of
convergence of a number of computing trends including the availability of low cost
microprocessors, high speed networks, and software for high performance distributed computing.
Clusters are usually deployed to improve performance and availability over that of a single
computer, while typically being much more cost-effective than single computers of comparable
speed or availability. Computer clusters have a wide rang of applicability and deployment,
ranging from small business clusters with a handful of nodes to some of the fastest
supercomputers in the world.

CA-120
COMPUTER ARCHITECTURE

Multiple Choice Type Questions


Overlappedregisterr windowS are used to speed-up procedure call and return in
1. WBUT 2007, 2011]
a) RISC architectures b) CISC architectures
(b) d) none of these
c) both (a) and
Answer:(a)

2. What is a main advantage of classical vector systems (VS) compared with RISC
based systems (RS)? [WBUT 2008, 2009]
alVs have significantly higher memory bandwidth than Rs
rate than RS
b) VS have higher clockthan RS
c) VS are more parallel
d) None of these
Answer: (a)
3.Difference between RISC and CISC is [WBUT 2010]
a) RISC is more complex b) CISC is more effective
c) RISCis better optimizable d) none of these
Answer: (a)
4. The advantage of RISC over CISC is that WBUT 2011]
a) RISC can achieve pipeline segments, requiring just one clock cycle
b) CISC uses many segments in its pipeline with the longest segment
requiring two or more clock cycle
c) both (a) &(b)
d) none of these
Answer: (d)
5. Which of the following is not RISC architecture characteristic?
[WBUT 2012, 2018]
a) simplified and unified format of code of instructions
b) no specialized register
C) no storage / storage instruction
d) small register file
Answer:(d)
6. Which of the following architectures correspond to von-Neumann architecture?
WBUT 2012]
a) MISD c) SISD d) SIMD
b) MIMD
Answer: (c)
7. The CPl
a) 1 value forb) RISC processors is (WBUT 2015]
2 c) 3 d) more
Answer: (a)
CA-121
POPULAR PUBLICATIONS
8. In
which of the following shared memory multiprocessor models
access shared memory is same? the time to
d) NUMA 2019]
WBUT
a) NORMA b) COMA c) UMA
Answer: (c)
Short Answer Type Questions
1. Compare between RISCand CISC. wBUT 2010, 2012, 2014, 2015, 20181
OR,
Compare RISCand CISC architecture inbrief.
Answer: WBUT 2019]
Characteristics CISC RISC
Instruction set size Instruction set is very large Instruction set is small and
and
instruction and instruction format is instruction format is fixed.
formats variable (16 64 bit per
instruction)
Addressing mode 12 - 24 3-5
General purpose
8-24 general purpose registers
registes and cache present. Unified cache is used Though most instructions are
register base so large numbers of
design for instruction and data
registers (32 - 192) are used and
cache is split, in data cache and
instruction cache.
CPI CPIis between 2 to 15 In most cases CPI is 1but
average
CPI is less than 1.5
CPUcontrol CPU is controlled by control CPU is controlled by hardware
memory (ROM) using without control memory
microprograms.
2. What are multiprocessor, multi-computer and multi-core systems?
WBUT 2012, 2014]
Answer:
In Multiprocessor system there are more than one processor that works
In this system there is one master processor and other are the Slave. If simultaneously.
one processor fails
then master can assign the task to other slave processor. But if Master will be fail than
entire system will fail. Central part of Multiprocessor is the Master. All of them share the
hard disk, memory and other devices.
A multicomputer system consisting of more than one computer,
usually under the
supervision of a master computer, in which smaller computers handle input/output and
routine jobs while the large computer carries out the more complex
A multi-core processor is a single computing component with twocomputations.
or
actual central processing units (called "cores"), which are the units thatmore independent
read and execute
program instructions. The instructions are ordinary CPU instructions such as add, move
data, and branch, but the multiple cores can run multiple instructions at the same time,
increasing overall speed for programs amenable to parallel computing. Manufacturers

CA-122
COMPUTER ARCHITECTURE

typicallyintegratethe cores ontoa single integrated circuit die or onto multiple dies in a
singlechippackage.
Whatis Von-Neumann architecture? What is a Von-Neumann bottleneck? How
3. W WBUT 2019]
can this be reduced?
Answer:
architecture was first published by John von Neumann in 1945. His
Von Neumann
computer architecture design consists of a Control Unit, Arithmetic and Logic Unit
Neumann architecture is based
IALU, Memory Unit, Registers and Inputs/Outputs. Von data and program data are
n the stored-program computer concept, where instruction computers produced today.
etored in the same memory. This design is stillused in most
responsible for executing the
The Central Processing Unit (CPU) is the electronic circuit to as the microprocessor or
instructions of a computer program. It is sometimes referred
processor. The CPUcontains the ALU, CU and a variety of registers. Registers
are high
in a register before it can be
Speed storage areas in the CPU. All data must be stored (add, subtract etc.) and
processed. Arithmetic and Logic Unit (ALU) allows arithmeticcontrol unit controls the
logic (AND, OR, NÞT étc.) operations to be carried out. The
operation of the computer's ALU, memory and input/output devices, telling them how to
from the memory unit.
respond to the program instructions it has just read and interpreted
The control unit also provides the timing and control issignals required by other computer
components. Buses are the means by which data transmitted from one part of a
to the CPUand memory.
computer to another, Connecting allmajor internal componentsbus and address bus.
standard CPU system bus is comprised of a control bus, data
A

Central Processing Unit

Control Unit

Arithmetic/ Logic Unit Output


Input Device
Device

Registers PC CR
AC MAR MDR

Memory Unit

a
PUs processing speed is much faster in comparison to the main memory (RAM) asand
The CPU
Tesult the CPUneeds to wait longer'to obtain data-word from the memory. can be
emory speed disparity is kniown as Von Neumann bottleneck. This problem
solved in two ways:
CA-123
POPULAR PUBLICATIONS
I. Use of cachememory between CPU and main mnemory
2. Using RISC computers
This performance problem can be reduced by introducing a cache
of fast memory) in between the CPUand the main memory. This ismemory (special type
because the speed of
thecache memory is almost same as that of the CPU. So there is no waiting time for CPU
and data-word to come to it for processing. Another way of solving the problem is by
USing special type of computer known as Reduced Instruction Set Computers (RISC).
The main intention of the RISCis to reduce the total number of memory references made
by the CPU; instead it uses large number of registers for the same
purpose.
Long Answer Type Questions
1. a) What is SPEC rating? Explain.
WBUT 2015]
b) À50 MHz processor was used to execute a program with the following
instruction mix and clock cycle counts:
Instruction type Instructioncount Clock cycle count
Integer arithmetic 50000
Data transfer 35000 2
Floating point arithmetic 20000 2
Branch 6000 3

Calculate the effective CPI, MIPS rate and execution time for this program.
Answer:
a) The Standard Performance Evaluation Corporation (SPEC) is an American non-profit
organization that aims to "produce, establish, maintain and endorse a standardized set" of
performance benchmarks for computers. SPEC was founded in 1988. SPEC benchmarks
are widely used to evaluate the performance of computer systems; the test results are
published on the SPEC website. Results are sometimes informally referred to as
"SPECmarks" or just "SPEC". SPEC evolved into an umbrella organization
encompassing four diverse groups; Graphics and Workstation Performance Group
(GWPG), the High Performance Group (HPG), the Open Systems Group (OSG) and the
newest, the Research Group (RG).
b) Total instruction count 111000
CPI(50000 x 1+35000 x 2 +20000 x 2 + 6000 x 3) /111000=l.6
MIP=(clock frequency) (CPI x 1000000) =(50 x1000000) / (1.6 x1000000) 31.25
Execution time =CPI x Instruction count x Clock time
= 1.6 x 111000 (1/ (50 x 1000000) =0.003ms

2. a) Explain different types of addressing modes?


b)What are the advantages of Relative addressing mode over DirectWBUT
addressing
2017]
mode?

CA-124
COMPUTER ARCHITECTURE

Answer:
Theterm addressing mmodes refers to the way in which the operand of an instruction is
a)
specified. Information contained in the instruction code is the value of the operand or the
address of the
result/operand. Following are the main addressing modes that are used on
and
various platforms l architectures.
Immediate Mode: The operand is an immediate value is stored explicitly in the
instruction.
Mode: The address of the operand is obtained by adding to the contents of the
2) Indexregister (called index register) aconstant value. The number of the index register
general
value are included in the instruction code.
andthe constant
3) Indirect Mode: The effective address of the operand is the contents of a register or
main memory
location, location whose address appears in the instruction. Indirection is
the instruction
noted by placing the name of the register or the memory address given in of the operand
address
in parentheses. The register or memory location that contains the be told to go to
isa pointer. When an execution takes place in such mode, instruction may
a specific address.
instruction code.
) DirectMode: The address of the operand is embedded in the
in the instruction. The
5) Register Mode: The name of the CPUregister is embedded register
register contains the value of the operand. The number of bits used to specify the
depends on the total number of registers from the processor set.
is
b) Relative Addressing Mode: In this mode, the content of the program counter (PC)
address. When the
added to the address part of the instruction to obtain the effective
number is added to the content of the PC, the result is an effective address whose position
in memory is relative to the address of the next instruction.
Effective Address (EA) = PC + A
Direct Addressing Mode: In this mode, the address of the memory location which holds
the operand is included in the instruction. The operand resides in memory and its address
is given by the address field of the instruction.
For example LDA 4000H
Instruction
LOpcode Address Memory

Operand

compared toimmediate mode.


Advantage: Simple and provides more flexibility
Disadvantage: Limited address field.
3. Write short notes on the following:
a) Power PC [WBUT 2007, 2010]
b) Non von Neumann architecture characteristics WBUT 2012]
c) Cluster Computer WBUT 2012, 2018]
d) RISC WBUT 2019]

CA-125
POPULAR PUBLICATIONS

Answer:
a) Power PC:
microprocessor is a highly integrated single-chip processor that combines a
The PowerPC organization, and a versatile hich
machine
powerful RISC architecture, a superscalarcontains a 32KB unified cache and is canahl
performance bus interface. The processor to 3 instructions per cycle. The be
and completing up
of dispatching, executing,provide a wide range of system bus interfaces, including
interface configurations transactions. 1he result is a cost effective, general
pipelined, non-pipelined, and split
microprocessor solution that offers very competitive performance.
purpose
Inetructlon Quaue and Dispatch Loglc
32 Y32
Flxed Floating
Branch Sequencer Point Polnt
Unil Unit Unlt

f32
64
Me mory
Instr. Menagement
Felch Unit 32 32
256
32 32

Cache Tags 32K8 Cache Array

256
Y32
Memory Queue + cOP
Bus Interface Unit Unit

Data COP Bus


Address

Fig: Power PC Architecture

figure, it is a superscalar design with three pipelined execution


As shown in the above up to three 32-bit instructions each cycle - one each to
units. The processor can dispatch
Unit (FXU), the Floating-Point Unit (FPU), and the branch unit (BPU).
the Fixed-Point
provides a 32-bit interface to the FXU, a 64-bit interface to the
The 32KB unified cache the instruction queue and the memory queue. The
FPU, and a 256-bit interface to both
32-bit address bus and a 64-bit data bus. The designers optimized the
chip VOs include a performance and concurrent instruction processing in each
601 pipeline structure for high
shown below.
of the execution units as
fixed-point pipeline performs all integer arithmetic logic unit (ALU) operations
and
The instructions, including floating-point loads
and all processor load and store
stores.
instruction pipeline has only two stages. The first stage can dispatch,
The branch the direction of a branch instruction in
decode, evaluate, and, if necessary, predict instructionS
cycle. On the next cycle, the resulting fetch can be accessing new
one
from the cache.
CA-126
COMPUTER ARCHITECTURE
efloating-point instruction pipeline
contains six stages and has been optimized for
fully pipelined execution of single-precision operations.
Branch
Dispatch
Decode
Fetch Execute
Predict
Integer Instructlons

Fetch Dispatch
Decode Execute Writeback
Losd/siore Instruetiong
Fetch Dispatch Address
Decode Cache writeback
Gen,

Floating-Polnt
Felch
Dispatch Decode Execute Execuie2witoback
Fig: PowerPC 601 pipeline Architecture

b) Non von Neumann architecture characteristics:


Any computer architecture in which the underlying model of computation is different
from what has come to be called the standard von Neumann model. A non von Neumann
machine may thus be without the concept of sequential flow of control (i.e. without any
register corresponding to a "program counter" that indicates the curent point that has
been reached in execution ofa program) and/or without the concept of a variable (i.e.
without named" storage locations in which a value may be stored and subsequently
referenced or changéd).
Examples of non von Neumann machines are the dataflow machines and the reduction
machines. In both of these cases there is a high degree of parallelism, and instead of
variables there are immutable bindings between names and constant values.
Note that the term non von Neumann is usually reserved for machines that represent a
radical departure from the von Neumann model, and is therefore not normally applied to
multiprocessor or multicomputer architectures, which effectively offer a set of
cOoperating von Neumann machines.
) Cluster Computer:
ACluster computer consists of a set of loosely connected computers that work together so
diat in many respects they can be viewed as a single system. The components of a cluster
usually connected to each other through fast local area networks, each node
(computer used as a server) running its own instance of an operating system. Computer
Musters emerged as a result of convergence of a number of computing trends including
de availability of lowcost microprocessors, high speed networks, and software for high
performance distributed computing. Clusters are usually deployed to improve
and availability over that of asingle computer, while typically being much
performance
Ore cost-effectiye than single computers of comparable speed or availability. Computer
CA-127
POPULAR PUBLICATIONS
clusters have a wide range of applicability and deployment, ranging from small
clusters with a handful of nodes to some of the fastest supercomputers in the
Computer clusters may be configured for different purposes ranging from
busiworld,
ness
purpose business needs such as web-service support, to computation-intensive general
scientific
calculations. In either case, the cluster may use a high-availability approach. Note that the
attributes described below are not exclusive and a "compute cluster" may also use a bioh
availability approach, etc. "Load-balancing" clusters are configurations in which
nodes share computational workload to provide better overall performance. For cluster-
a web server cluster may assign different queries to different nodes, so the example,
overall
response time will be optimized. However, approaches to load-balancing may
significantly differ among applications, e.g. a high-performance cluster used for scientific
computations would balance load with different algorithms from a web-server cluster
which may just use a simple round-robin method by assigning each new request to a
di fferent node.

d) RISC:
RISC, or Reduced Instruction Set Computer is a type of microprocessor architecture that
utilizes a small, highly-optimized set of instructions, rather than a more specialized set of
instructions often found in other types of architectures. The first RISC projects came
from IBM, Stanford, and UC-Berkeley in the late 70s and early 80s. The IBM 801,
Stanford MIPS, and Berkeley RISC 1 and 2 were all designed with a similar philosophy
which has become known as RISC. Certain design features have been characteristic of
most RSCprocessors:
One cycle execution time: RISC processors have a CPI(clock per instruction) of one
cycle. This is due to the optimization of each instruction on the CPU and a technique
called;
Pipelining: a technique that allows for simultaneous execution of parts, or stages, of
instructions to more efficiently process instructions.
Large number of registers: the RISC design philosophy generally incorporates a
larger number of registers to prevent in large amounts of interactions with memory.
Characteristics of RISC
Simpler instruction, hence simple instruction decoding.
Instructions come under size of one word.
Instructions take single clock cycle to get executed.
More number of general purpose register.
Simple Addressing Modes.
Less Data types.
Pipeline can be achieved.

CA-128

You might also like