practical ACA
practical ACA
LAB MANUAL
(MCSE-103)
NAME:
ENROLLMENT NUMBER:
TEACHER IN CHARGE: -
DECEMBER 2024
Index
S.N Experiment Date Remark Sign
o.
1 Study of different Classification
Schemes
Flynn’s Classification-
It is based on multiplicity of instruction stream and data stream in computer system. this
classification scheme was produced by Michale J Flynn.
Digital computer may be classified into four categories according to the multiplicity of
instruction & data stream. The essential computing process is the execution of the sequence of
instruction on a set of data. The term stream used to hear to denote a sequence items
(instruction & data)as executed or operated upon by a single processor. Instruction or data are
defined with respect to a referenced machine & data stream is a sequence of data including
input partial or temporary result called for by the instruction stream.
Flynn’s Organization-
Single instruction stream-single data stream(SISD)
Single instruction stream-multiple data stream (SIMD)
Multiple instruction stream-single data stream(MISD)
Multiple instruction stream-multiple data stream(MIMD)
Instruction & data are fetched from memory modules. Instruction are decode by control unit
which sends the decode instruction stream to the processor unit for execution. Data stream
flow between, the processor and the memory bidirectional multiple memory module may be
used in the shared memory subsystem. Each instruction stream is generated by independent
control unit Multiple data stream originate from the sub system of subsystem module.
Horizontal axis shows word length (n) & vertical axis shows bit slice length (m)
Each PCU correspond to one processor or one CPU. The ALU is equivalent to the processing
element (PE’s). Computer system can be characterized by triple containing six independent
entities as below:
Where,
Shore classification-
Unlike Flynn’s ,shore classified the computers on the basis of organization of the constituent
elements in the computers Six different kind of machine were recognized-
Machine 1-
There are conventional Von Neumann architecture with following units in single quantities.
Control unit
Processing unit
Instruction Memory(IM)
Data Memory(DM)
A single DM read producer all bits of any word for processing in parallel by PV The PU may
contain multiple functional unit.
Machine 2-
Similar to Machine 1 except the DM fetcher a bit slice from all the words in the memory & PU is
organized to perform the operation in a bit serial manner on all the words.
Machine 3-
This is combination of 2 machine 1 & 2. It could be characterized having a memory as an array
of bits with both horizontal 2 vertical reading 2 processing possible.
Machine 4-
It is obtained by replicating the PU & DM of machine 1. An assemble of PU &DM is called as
processing Element (PE).Instructing is issued to PE by a single Control Unit.
Machine 5-
It is similar to machine 4 with the addition of communication between processing element Eg.
ILLIAC IV
Machine 6-
Machine 1 & 5 maintain separation between data memory & processing unit with some data
bus connection unit providing the communication between them.
Experiment 2
Aim- Study of Arithmetic Pipeline.
The complex arithmetic operations like multiplication and floating point operations consume
much of the time of the ALU. These operations can also be pipelined by segmenting the
operations of the ALU and as a consequence, high speed performance may be achieved. Thus,
the pipelines used for arithmetic operations are known as arithmetic pipelines.
The technique of pipelining can be applied to various complex and slow arithmetic operations
to speed up the processing time. The pipelines used for arithmetic computations are called
Arithmetic pipelines. In this section, we discuss arithmetic pipelines based on arithmetic
operations. Arithmetic pipelines are constructed for simple fixed-point and complex floating-
point arithmetic operations. These arithmetic operations are well suited to pipelining as these
operations can be efficiently partitioned into subtasks for the pipeline stages. For implementing
the arithmetic pipelines we generally use following two types of adder:
We take the example of multiplication of fixed numbers. Two fixed-point numbers are added by
the ALU using add and shift operations. This sequential execution makes the multiplication a
slow process. If we look at the multiplication process carefully, then we observe that this is the
process of adding the multiple copies of shifted multiplicands as show below:
Now, we can identify the following stages for the pipeline:
The first stage generates the partial product of the numbers, which form the six rows of
shifted multiplicands.
In the second stage, the six numbers are given to the two CSAs merging into four
numbers.
In the third stage, there is a single CSA merging the numbers into 3 numbers.
In the fourth stage, there is a single number merging three numbers into 2 numbers.
In the fifth stage, the last two numbers are added through a CPA to get the final
product.
These stages have been implemented using CSA tree
the stream of instructions in the instruction execution cycle, can be realized through a pipeline
where overlapped execution of different operations are performed. The process of executing
the instruction involves the following major steps:
Figure A
Instruction buffers:
For taking the full advantage of pipelining, pipelines should be filled continuously. Therefore,
instruction fetch rate should be matched with the pipeline consumption rate. To do this,
instruction buffers are used. Instruction buffers in CPU have high speed memory for storing the
instructions. The instructions are pre-fetched in the buffer from the main memory. Another
alternative for the instruction buffer is the cache memory between the CPU and the main
memory. The advantage of cache memory is that it can be used for both instruction and data.
But cache requires more complex control logic than the instruction buffer. Some pipelined
computers have adopted both.
Experiment -4
Aim: Study of Branch Handling in pipeline.
The performance of pipelined processors is limited by data dependence & branch instruction.
The evaluation of branching strategies can be performed either or specific pipeline architecture
using trace data or by applying analytic models.
Effect of Branching:
3 Basic terms are introduced below for the analysis of branching effect :
The action of fetching a non sequential or remote instruction after a branch instruction is called
branch taken. The instruction to be executed after a branch taken is called branch target. The
number of pipeline cycle wasted between a branch taken its branch target is called the delay
slot, denoted by b . In general 0<b<k-1, where k is the number of pipeline stages. When a
branch taken occurs all the instruction following the branch in the pipeline become useless and
will be drained from the pipeline.
This implies that a branch taken cause the pipeline to be flushed losing a no of useful cycles.
These term are illustrated in fig where a branch taken cause I B+1 through Ib+k-1 to be drained from
the pipeline. Let P be the probability of a conditional branch instruction ina typical instruction
stream & q the probability of a successfully executed conditional branch instruction in a branch
taken typical value of p + 20% & q = 60% have been observed in some programs :
The penalty paid by branching is equal to pqnbr because each branch taken costs if extra
pipeline cycles. The total execution of n instruction including the effect of branching as
follows :
The above analysis implies that performance can be degraded by 46% with branching when the
instruction stream is sufficiently 10ng. This analysis demonstrate the degree of performance
caused by branching in an instruction pipeline.
BRANCH PREDICTION :
Branch can be predicted either based on branch code types statically or based on branch
history during program execution. The probability of branch instruction type can be used to
predict branch ? This requires collecting the frequency & probability of branch taken abd
branch type across a large no. of program trace. Such a static branch strategy may nt be always
accurate.
The static prediction direction (taken or not taken) is usually wired into the processor. The
wired in static prediction cannot be changed once committed to the hw However the scheme
can be modified to arrow the programmer or compiler to select the direction of each branch on
a semi static prediction basis>.
A dynamic branch strategy uses recent branch history to predict whether or not the branch will
be taken next time when it occurs. To be accurate one may need to use the entire history of the
branch to predict the future choice. This ps infeasible to implement . Therefore most dynamic
prediction is determined recent history.
Gayon (1992) has classified dynamic branch strategies Into 3 major classes :
One class predicts the branch direction based upon information found at the decode stage>
The second class uses a cache to store target address a the stage the effective address of the
branch target is computed.
The third scheme uses a cheche to store target instruction at the fetch stage all dynamic
prediction are adjusted dynamically as a program is executed.
Dynamic prediction demands additional h/w to keep track of the past behavior of the branch
instruction at run time. The amount of history recorded should be small. Otherwise the
prediction logic becomes too costly to implement.
Lee & Smith (1994) have shown the use of a branch target buffer (BTB) to implement branch
prediction. The BTB is used to hold recent branch information including address of the branch
target used. The address of the branch instruction locates its entry in the BTB.
Experiment 5
Aim- Study of different type of hazards.
The basic requirement of any pipeline processor is that various stages in the pipeline
should work independent of each other.However it may not be always the case. There may be
some hindrance which may cause disturbance in the smooth flow through the pipeline such
hindrance in the pipeline terminology are called as Hazards . There are bottleneck of pipeline
design .These hazards prevent the smooth flow in the pipeline and degrade the performance of
the pipeline. There are 3 types of harards.
DATA HAZARDS
CONTROL HAZARDS
STRUCTURAL HAZARDS
DATA HAZARDS-
Data hazards occur when instructions that exhibit data dependence modify data in different
stages of a pipeline.
Data Dependence-
Flow dependency
1. A = 3
2. B = A
3. C = B
Instruction 3 is truly dependent on instruction 2, as the final value of C depends on the
instruction updating B. Instruction 2 is truly dependent on instruction 1, as the final value of B
depends on the instruction updating A. Since instruction 3 is truly dependent upon instruction 2
and instruction 2 is truly dependent on instruction 1, instruction 3 is also truly dependent on
instruction 1.
Anti-dependency
1. B = 3
2. A = B + 1
3. B = 7
1. B = 3
N. B2 = B
2. A = B2 + 1
3. B = 7
A new variable, B2, has been declared as a copy of B in a new instruction, instruction N. The
anti-dependency between 2 and 3 has been removed, meaning that these instructions may
now be executed in parallel. However, the modification has introduced a new dependency:
instruction 2 is now truly dependent on instruction N, which is truly dependent upon
instruction 1.
Output dependency
An output dependency, also known as write-after-write (WAW), occurs when the ordering of
instructions will affect the final output value of a variable. In the example below, there is an
output dependency between instructions 3 and 1 — changing the ordering of instructions in
this example will change the final value of A, thus these instructions cannot be executed in
parallel.
1. B = 3
2. A = B + 1
3. B = 7
As with anti-dependencies, output dependencies are name dependencies. That is, they may be
removed through renaming of variables, as in the below modification of the above example:
1. B2 = 3
2. A = B2 + 1
3. B = 7
A commonly used naming convention for data dependencies is the following: Read-after-Write
or RAW (flow dependency), Write-after-Write or WAW (output dependency), and Write-After-
Read or WAR (anti-dependency).
Input Dependence-
Read and write are imput statement input output dependence occur not because the same variable
A statement S2 is input dependent on S1 if and only if S1 and S2 read the same resource and S1
precedes S2 in execution. The following is an example of an input dependence (RAR: Read-
After-Read):
S1 y := x + 3
S2 z := x + 5
Unknown Dependence-
The dependence relation between two statement can not be deter mind in the following situation .
The subscript of a variable is itself subscribed (indirect addresing).
The subscript does not contain the loop index .
A variable appears more than once with subscript is not linear in the loop index variable.
when one or more of these condition exist a cover alive assumption is to claim known dependence
among the statement involved .
(i2 tries to read a source before i1 writes to it) A read after write (RAW) data hazard refers to a
situation where an instruction refers to a result that has not yet been calculated or retrieved.
This can occur because even though an instruction is executed after a previous instruction, the
previous instruction has not been completely processed through the pipeline.
For example:
i1. R2 <- R1 + R3
i2. R4 <- R2 + R3
The first instruction is calculating a value to be saved in register R2, and the second is going to
use this value to compute a result for register R4. However, in a pipeline, when we fetch the
operands for the 2nd operation, the results from the first will not yet have been saved, and
hence we have a data dependency.
We say that there is a data dependency with instruction i2, as it is dependent on the completion
of instruction i1.
(i2 tries to write a destination before it is read by i1) A write after read (WAR) data hazard
represents a problem with concurrent execution.
For example:
i1. R4 <- R1 + R5
i2. R5 <- R1 + R2
If we are in a situation that there is a chance that i2 may be completed before i1 (i.e. with
concurrent execution) we must ensure that we do not store the result of register R5 before i1
has had a chance to fetch the operands.
(i2 tries to write an operand before it is written by i1) A write after write (WAW) data hazard
may occur in a concurrent execution environment.
Example
For example:
i1. R2 <- R4 + R7
i2. R2 <- R1 + R3
Branching hazards (also known as control hazards) occur with branches. On many instruction
pipeline micro architectures, the processor will not know the outcome of the branch when it
needs to insert a new instruction into the pipeline. These type of hazards arise due to control
dependence b/w various stages of pipeline control dependence refer to the situation where
order of execution of statement can’t be deter mind before run time. For example conditional
will not be resolved until run-time. Different path taken after a conditional branch may
introduce or eliminate data dependence among instruction dependence may also exist b/w
operation performed in successive interaction of a looping procedure.
Control hazards often prohibit parallelism from being exploited compiler techniques are needed
to get around the control hazard in order to exploit more parallelism.
Structural hazards
A structural hazard occurs when a part of the processor's hardware is needed by two or more
instructions at the same time. A canonical example is a single memory unit that is accessed
both in the fetch stage where an instruction is retrieved from memory, and the memory stage
where data is written and/or read from memory.They can often be resolved by separating the
component into orthogonal units (such as separate caches) or bubbling the pipeline.
Experiment 6
Aim: Implement parallel algorithm for array processor(SIMD)
The original motivation for developing SIMD array processor was to perform parallel
computation on vector or matrix types of data parallel processing algorithm have been
developed by many computer scientist for SIMD computers important SIMD algorithm can be
used to perform matrix multiplication, fast Fourier transform (FFT) matrix transposition
summation of vector element matrix inversion, parallel sorting linear recurrence Boolean
matrix operation & to solve partial differential equation.
matrix multiplication is frequently needed in solving linear system of equation. The difference
between SISD & SIMD matrix algorithm are pointed out in their program architecture& speed
performance. In general the linear loop of multilevel SISD program can be replaced by one or
more simd vector instruction
B=[b kj] be n*n matrix. The multiplication of A & B generate a product matrix C=A*B=[Cij] of
dimension n*n. The element of the product matrix C is related to the elements of A & B by
There are n^3 cumulative multiplication to be performed by eq1.In conventional SISD uni-
processor system the cumulative multiplication are carried out by serially coded program.
For i=1 to n do
For j=1 to n do
Cij= 0(initialization)
For K=1 to n do
End of K loop
End of J loop
End of i loop.
Matrix multiplication on SIMD computer with n PE’s algorithm construct depends heavily on
memory allocation of A & B & C matrices in PE’m. If we store each row vector of matrix across
PEM& column vector within same PEM. Memory allocation scheme allows parallels access of
the entire matrix element. In algorithm two parallel do operations corresponds to vector load
for initialization & vector multiply for the inner loop of additive multiplication. The time
complexity has been reduced to O(n^2) i.e. SIND algorithm is n time faster than SISD algorithm
for matrix multiplication .
for K=1 to n Do
Cik=0(vector load)
End of K loop
End of j loop;
End of I loop;
It should be noted that the vector load operation is performed to initialize the row vector of
matrix one row of 0 time. In the vector multiply operation the same multiplier aij is broadcast
from the C to all PEs to multiply all n element of the row vector of B.
J1:move all odds to the left column & all even to the right in 2tk time.
J2:use the odd even transposition sort to sort each column in 2j+k+j+c time
J3:interchange on each row in 2tk time
M6: Comparision interchange adjacent elements (every even with the next odd)[time:4tk+tk]
Experiment -7
Aim -Implement search algorithm.
Associative search algorithm:
Associative memories are basically used for the fast search & ordered retrieval of large files of
records. Many researchers have suggested using associative memories for implementing
relational database machine. Every relational database or relation of records can be arranged in
a tabular form as illustrated in example. The tabulation of records (relation) can be
programmed into the cells of an associative memory various associative search operations have
been divided into the following categories by T.Y Feng(1976).
Extreme Search
The Maxima: The largest among a collection of record is researched.
The Minima: The smallest among a collection of records is searched.
The Media : Find for the median according to a certain ordering.
Equivalence Searches
Equal to: Exact match is searched under a certain equality relation.
Not-equal-to: Search for these elements under equal to the given key.
Similar to: Search for a match with in the masked field.
Proximate to: Search for those records that satisfy a certain proximity (neighbourhood)
condition.
Threshold search
Smaller-than: Search for these records that are strictly smaller than the given key.
Greater than: Search for these records that are strictly greater than the given key.
Not smaller than: Search for these records that are equal to or greater than the given
key.
Not greater than: Search for those records that are equal to or smaller than the given
key.
Adjacency Searches
Near below: Search for the nearest record which is smaller than the key.
[X,Y]: Search for these records within the closed range {Z|X≤Z≤Y}
(X,y): Search for these records within the open range{Z|X<Z<Y}
[X,Y): Search for these records within the range {Z|X<Z<Y}
(X,Y]: Search for those records within the range {Z|X≤Z≤Y}
Ordered retrievals
Listed above are primitive search operations of course one can always combine a
sequence of primitive search operations by some Boolean operators to form various
query conjuction. For eg: one may wish to answer the queries equal to A but not equal
to B the second largest from below. Outside the range etc Boolean operators AND, OR&
NOT can be used to form any query conjuction of predicate. A predicate consist of one
of the above relation operators plus on ployertory such as the paris {≤, A} or {≠, A}. The
above search operators are frequently used in text retrieval operations.
Example: The minima search that algo searches for the smallest number among a set of
n positive number stored in a bit serial AM array. Every number has I bits stored in a
field of a word from the bit position to the bit position S+f-1.
Dynamic algorithm on the other hand, takes factors into account such as the current load to
each processor. Adaptive algorithm are special sub class of dynamic algorithm.
In physical non- distributed or centralized scheduling policies, a single processor makes all
decision regarding task placement has obvious implication for the autonomy of the
participating system.
Under physical distributed algorithm the logical authority for the decision making process is
distributed among the processor that constitute system.
Under non- cooperative distributed scheduling policies individual processor makes scheduling
choice independent of the choice make by other processor with co- operative scheduling the
processor sub ordinate local autonomy to the achievement of common goal.Both static and co-
operative distributed scheduling have optimal and sub- optimal branches.
Optimal assignments can be reached if complete information describing the system and the
task force is available.
Sub optimal algorithm are either approximate or heuristic. Heuristic algorithm use guiding
principle such as assigning tasks with heavy inter- task communication to the same processor or
placing the job first.
Approximate solution use the same computational methods as optimal solutions but use
solution that are within an acceptable range according to an algorithm depending metric.
Mathematical programming
Queuing theory
Location rule
Distribution rule
Selection rule
The selection rule works either in pre-emption or in non- pre-emption fashion. The newly
generated process is always picked up by the non- pre-emptive rule while the running process
may be picked up by pre-emptive rule. Pre-emptive transfer is costly than non- pre-emptive
transfer which is more preferable . However pre-emptive transfer is more excellent than non-
pre-emptive transfer in some instances.
Practically load balancing decision are taken jointly by location and distribution rule. The
balancing domains are of 2 types; local and global. In local domain the balancing decision is
taken from a group of nearest neighbours by exchanging the local work load information while
in global domain the balancing decision is taken by triggering transfer partners across the
whole system it exchange work load information globally.
Load balancing improves the performance of each node and hence the overall system
performance.
Load balancing reduces job idle time.
Small job do not suffer from long starvation.
Maximum utilization of resources.
Response time become shorter.
Higher throughput.
Higher reliability
Low cost but high gain
Extensibility and incremental growth
For the above benefits load balancing strategy becomes a field of intensive research.
The selection of load balancing depends on application parameter like balancing quality, load
generation, pattern and also between parameter like communication overhead. Generally load
balancing algorithms are of 2 types:
Static Load Balancing: In static algorithm the processor are assigned to the processor at
the compile time according to the performance of the node. Once the processor are
assigned, no change or re- assignment is possible at run time. No. of jobs in each node is
fixed in static load balancing algorithm. Static algorithm do not collect any information
about the node. The assignment of job is done to the processing node on the basis of
the following factors: Incoming time, Extent of resources needed, mean execution time
and inter process communication static load balancing algorithm also called probabilistic
algorithm. Static load balancing algorithm divided into 2 sub class:
Optimal static load balancing: If all the information and resource related to a system
are known optimal static load balancing can be done. It is possible to increase
throughput of a system and to maximize the use of resource by optimal load
balancing algorithm.
Sub optimal static load balancing: Sub optimal load balancing algorithm will be
mandatory for some application when optimal solution is not found.
Dynamic Load balancing:
During the static load balancing too much information about the system and job must
be known before execution. These information may not be available in advance so
dynamic load balancing algorithm come into existence. The assignment of job is done at
run time. In DLB jobs are re- assigned at the run time depending upon the situation that
is the load will be transferred from heavily loaded nodes to the lightly loaded nodes. In
this case communication over heads occur and become more when no. of processors
increases. In DLB no decision is taken until the process gets execution. The strategy
collects the information about the system state and about the job information.
EXPERIMENT-10
Aim- Study of deadlock avoidance in distributed system.
Deadlock-
Deadlock is a condition in a system where a process can’t proceed because it needs to obtain a
resource held by another process but it itself is holding a resource that the other process needs.
In a system of processes which communicate only with a single central agent deadlock can be
detected easily because the central agent has complete information about every process.
Deadlock is a fundamental problem in distributed system.
Types of deadlock-
Communication Deadlock-
It occurs when process A is trying to send a message to process B, which is trying to send
0 message to process C, which is trying to send message to process A.
Resource Deadlock-
It occur when process are trying to get exclusive access to device files, locks, server or
other resource.
Circular Wait- Two or more peocess are waiting for resource hold by one of the
other process.
Process Termination-
One or more process involved in the deadlock may be aborted.
Resource Preemption-
Resource allocated to various process may be successively preempted and allocated to
other process until the deadlock is broken.
Prevention-
Deadlock prevention works by preventing one of the 4 Coffman condition from
occurring.
o Removing the mutual exclusion condition means that no process will have exclusive
access to a resource.
o The hold and wait or resource holding condition may be removed by requiring
processes to request all the resource they will need before starting up.
o The no. of preemption condition may also be difficult or impossible to avoid.
o Circular wait approaches that avoid circular wait includes disabling interrupts during
circular sections.
Avoidance- Deadlock can be avoided by certain information about process are available
to the OS before allocation of resource, as which resource a process will consume in its
lifetime. For every resource request, the systems see whatever granting the request will
mean that the system will enter an unsafe state, meaning a state that could result in
deadlock. The system then only grants request that will lead to safe state. In order for
the system to be able to determine whether the next state will be safe or unsafe. It
must know in advance at any time.
One known algorithm. That is used for deadlock avoidance is the bankes’s algo. Which requires
resource usage limit to be known in advance? However for many systems it is impossible to
know in advance what ever processes will require. This means that deadlock avoidance is often
impossible.
Two other algorithm. Are wail/die & wound/Die each of which uses a symmetry-
breaking technique.
CU:CONTROL UNIT
PU:PROCESSOR UNIT
MM:MEMORY MODULE CU PU MM
SM:SHARED MEMORY