ACA Notes TechJourney PDF
ACA Notes TechJourney PDF
MODULE-1
Computers have gone through two major stages of development: mechanical and electronic. Prior
to 1945, computers were made with mechanical or electromechanical parts.
The earliest mechanical computer can be traced back to 500 BC in the form of the abacus used in
China.
The abacus is manually operated to perform decimal arithmetic with carry propagation digit by
digit.
Blaise Pascal built a mechanical adder/subtractor in Prance in 1642. Charles Babbage designed a
difference engine in England for polynomial evaluation in 1827.
Konrad Zuse built the first binary mechanical computer in Germany in 1941. Howard Aiken
proposed the very first electromechanical decimal computer, which was built as the Harvard Mark
I by IBM in 1944.
Both Zuse's and Aiken's machines were designed for general-purpose computations.
Computer Generations
Over the past five decades, electronic computers have gone through five generations of
development. Each of the first three generations lasted about 10 years.
The fourth generation covered a time span of 15 years.
We have just entered the fifth generation with the use of processors and memory devices with more
than 1 million transistors on a single silicon chip.
- Computing Problems
The use of a computer is driven by real-life problems demanding fast and accurate solutions.
Depending on the nature of the problems, the solutions may require different computing resources.
For numerical problems in science and technology, the solutions demand complex mathematical
formulations and tedious integer or floating-point computations.
For alpha numerical problems in business and government, the solutions demand accurate
transactions, large database management, and information retrieval operations.
For artificial intelligence (AI) problems, the solutions demand logic inferences and symbolic
manipulations.
These computing problems have been labeled numerical computing, transaction processing, and
logical reasoning.
Some complex problems may demand a combination of these processing modes.
- Hardware Resources
A modern computer system demonstrates its power through coordinated efforts by hardware
resources, an operating system, and application software.
Processors, memory, and peripheral devices form the hardware core of a computer system.
Special hardware interfaces are often built into I/O devices, such as terminals, workstations, optical
page scanners, magnetic ink character recognizers, modems, file servers, voice data entry, printers,
and plotters.
These peripherals are connected to mainframe computers directly or through local or wide-area
networks.
- Operating System
An effective operating system manages the allocation and deallocation of resources during the
execution of user programs.
Beyond the OS, application software must be developed to benefit the users.
Standard benchmark programs are needed for performance evaluation.
Mapping is a bidirectional process matching algorithmic structure with hardware architecture, and
vice versa.
Efficient mapping will benefit the programmer and produce better source codes.
The mapping of algorithmic and data structures onto the machine architecture includes processor
scheduling, memory maps, interprocessor communications, etc.
These activities are usually architecture-dependent.
- Compiler Support
There are three compiler upgrade approaches:
Preprocessor: A preprocessor uses a sequential compiler and a low-level library of the target
computer to implement high-level parallel constructs.
Precompiler: The precompiler approach requires some program flow analysis, dependence
checking, and limited optimizations toward parallelism detection.
Parallelizing Compiler: This approach demands a fully developed parallelizing or vectorizing
compiler which can automatically detect parallelism in source code and transform sequential codes
into parallel constructs.
Therefore, the study of architecture covers both instruction-set architectures and machine
implementation organizations.
Flynn's Classification
Michael Flynn (1972) introduced a classification of various computer architectures based on notions of
instruction and data streams.
1. SISD (Single Instruction stream over a Single Data stream) computers
2. SIMD (Single Instruction stream over Multiple Data streams) machines
3. MIMD (Multiple Instruction streams over Multiple Data streams) machines.
4. MISD (Multiple Instruction streams and a Single Data stream) machines
Development Layers
Hardware configurations differ from machine to machine, even those of the same model.
Address Space of a processor in a computer system varies among different architectures. It
depends on the memory organization, which is machine-dependent. These features are up to the
designer and should match the target application domains.
We want to develop Application Programs and Programming Environments which are
machine-independent. Independent of machine architecture, the user programs can be ported to
many computers with minimum conversion costs.
High-level languages and Communication Models depend on the architectural choices made in a
computer system. From a programmer's viewpoint, these two layers should be architecture-
transparent.
At present, Fortran, C, Pascal, Ada, and Lisp are supported by most computers.
However, the Communication Models, shared variables versus message passing, are mostly
machine-dependent. The Linda approach using tuple spaces offers architecture transparent
Communication model for parallel computers.
Application programmers prefer more architectural transparency. However, kernel programmers
have to explore the opportunities supported by hardware.
As a good computer architect, one has to approach the problem from both ends.
The compilers and OS support should be designed to remove as many architectural constraints as
possible from the programmer.
• An accurate estimate of the average CPI requires a large amount of program code to be traced over
a long period of time.
• Unless specifically focusing on a single instruction type, we simply use the term CPI to mean the
average value with respect to a given instruction set and a given program mix.
Performance Factors
• Let Ic be the number of instructions in a given program, or the instruction count.
• The CPU time (T in seconds/program) needed to execute the program is estimated by finding the
product of three contributing factors:
T = Ic x CPI x τ (1.1)
• The execution of an instruction requires going through a cycle of events involving the instruction
fetch, decode, operand(s) fetch, execution, and store results.
• In this cycle, only the instruction decode and execution phases are carried out in the CPU.
• The remaining three operations may be required to access the memory. We define a memory cycle
as the time needed to complete one memory reference.
• Usually, a memory cycle is k times the processor cycle τ.
• The value of k depends on the speed of the memory technology and processor-memory
interconnection scheme used.
• The CPI of an instruction type can be divided into two component terms corresponding to the total
processor cycles and memory cycles needed to complete the execution of the instruction.
• Depending on the instruction type, the complete instruction cycle may involve one to four memory
references (one for instruction fetch, two for operand fetch, and one for store results). Therefore we
can rewrite Eq. 1.1 as follows;
T = Ic x (p + m x k) x τ (1.2)
where p is the number of processor cycles needed for the instruction decode and execution,
m is the number of memory references needed,
k is the ratio between memory cycle and processor cycle,
Ic is the instruction count,
r is the processor cycle time.
Equation 1.2 can be further refined once the CPi components (p,m,k) are weighted over the entire
instruction set.
System Attributes
• The above five performance factors (Ic, p, m, k, τ) are influenced by four system attributes:
instruction-set architecture, compiler technology, CPU implementation and control, and cache and
memory hierarchy, as specified in Table 1.2.
• The instruction-set architecture affects the program length (Ic) and processor cycle needed (p). The
compiler technology affects the values of Ic, p and the memory reference count (m).
• The CPU implementation and control determine the total processor time (p. τ) needed.
• Finally, the memory technology and hierarchy design affect the memory access latency (k. τ). The
above CPU time can be used as a basis in estimating the execution rate of a processor.
MIPS Rate
• Let C be the total number of clock cycles needed to execute a given program.
• Then the CPU time in Eq. 1.2 can be estimated as T = C x τ = C/f.
• Furthermore, CPI = C/Ic and T = Ic x CPI x τ = Ic x CPI/f. The processor speed is often measured in
terms of million instructions per second (MIPS).
• We simply call it the MIPS rate of a given processor. It should be emphasized that the MIPS rate
varies with respect to a number of factors, including the clock rate (f), the instruction count (Ic),
and the CPI of a given machine, as defined below:
• Based on Eq. 1.3, the CPU time in Eq. 1.2 can also be written as T = Ic X 10-6 / MIPS.
• Based on the system attributes identified in Table 1.2 and the above derived expressions, we
conclude by indicating the fact that the MIPS rate of a given computer is directly proportional to
the clock rate and inversely proportional to the CPI.
• All four system attributes, instruction set, compiler, processor, and memory technologies, affect the
MIPS rate, which varies also from program to program.
Throughput Rate
• Number of programs a system can execute per unit time, called the system throughput Ws (in
programs/second).
• In a multiprogrammed system, the system throughput is often lower than the CPU throughput Wp
defined by:
• Note that Wp = (MIPS) X 106/Ic from Eq. 1.3- The unit for Wp is programs/second.
• The CPU throughput is a measure of how many programs can be executed per second, based on the
MIPS rate and average program length (Ic).
• The reason why Ws < Wp is due to the additional system overheads caused by the I/O, compiler,
and OS when multiple programs are interleaved for CPU execution by multiprogramming or
timesharing operations.
• If the CPU is kept busy in a perfect program-interleaving fashion, then Ws = Wp. This will
probably never happen, since the system overhead often causes an extra delay and the CPU may be
left idle for some cycles.
Programming Environments
Implicit Parallelism
An implicit approach uses a conventional language, such as C, Fortran, Lisp, or Pascal, to write the
source program.
The sequentially coded source program is translated into parallel object code by a parallelizing
compiler.
The compiler must be able to detect parallelism and assign target machine resources. This compiler
approach has been applied in programming shared-memory multiprocessors.
With parallelism being implicit, success relies heavily on the "intelligence" of a parallelizing
compiler.
This approach requires less effort on the part of the programmer.
Explicit Parallelism
The second approach (Fig. 1.5b) requires more effort by the programmer to develop a source
program using parallel dialects of C, Fortran, Lisp, or Pascal.
Parallelism is explicitly specified in the user programs.
This will significantly reduce the burden on the compiler to detect parallelism.
Instead, the compiler needs to preserve parallelism and, where possible, assigns target machine
resources.
1. Shared-Memory Multiprocessors
There are 3 shared-memory multiprocessor models:
i. Uniform Memory-access (UMA) model,
ii. Non uniform-Memory-access (NUMA) model
iii. Cache-Only Memory Architecture (COMA) model.
These models differ in how the memory and peripheral resources are shared or distributed.
In a UMA multiprocessor model (Fig. 1.6), the physical memory is uniformly shared by all the
processors.
All processors have equal access time to all memory words, which is why it is called uniform
memory access.
Each processor may use a private cache. Peripherals are also shared in some fashion.
Multiprocessors are called tightly coupled systems dun to the high degree of resource sharing. The
system interconnect takes the form of a common bus, a crossbar switch, or a multistage network.
Most computer manufacturers have multiprocessor (MP) extensions of their uniprocessor
(UP) product line.
The UMA model is suitable for general-purpose and timesharing applications by multiple users. It
can be used to speed up the execution of a single large program in time-critical applications. To
coordinate parallel events, synchronization and communication among processors are done through
using shared variables in the common memory.
When all processors have equal access to all peripheral devices, the system is called a symmetric
multiprocessor. In this case, all the processors are equally capable of running the executive
programs, such as the OS kernel and I/O service routines.
A NUMA multiprocessor is a shared-memory system in which the access time varies with the
location of the memory word.
Two NUMA machine models are depicted in Fig. 1.7.
The shared memory is physically distributed to all processors, called local memories.
The collection of all local memories forms a global address space accessible by all processors.
It is faster to access a local memory with a local processor. The access of remote memory attached
to other processors takes longer due to the added delay through the interconnection network.
The BBN TC-2000 Butterfly multiprocessor assumes the configuration shown in Fig. 1.7a.
2. Distributed-Memory Multicomputers
SIMD computers have a single instruction stream over multiple data streams.
An operational model of an SIMD computer is specified by a 5-tuple:
M = (N,C, I,M, R)
where
1. N is the number of processing elements (PEs) in the machine. For example, the Illiac IV had 64
PEs and the Connection Machine CM-2 had 65,536 PEs.
Notes by Shylaja B, Asst. Prof, Dept of CSE, DSATM, Bangalore 18
ACA (15CS72) Notes Module-1
2. C is the set of instructions directly executed by the control unit (CU), including scalar and
program flow control instructions.
3. I s the set of instructions broadcast by the CU to all PEs for parallel execution. These include
arithmetic, logic, data routing, masking, and other local operations executed by each active PE
over data within that PE.
4. M is the set of masking schemes, where each mask partitions the set of PEs into enabled and
disabled subsets.
R is the set of data-routing functions, specifying various patterns to be set up in the interconnection
network for inter-PE communications.
PRAM is a theoretical model of parallel computation in which an arbitrary but finite number of
processors can access any value in an arbitrarily large shared memory in a single time step.
Processors may execute different instruction streams, but work synchronously. This model assumes
a shared memory, multiprocessor machine as shown:
The machine size n can be arbitrarily large
The machine is synchronous at the instruction level. That is, each processor is executing it's own
series of instructions, and the entire machine operates at a basic time step (cycle). Within each
cycle, each processor executes exactly one operation or does nothing, i.e. it is idle.
An instruction can be any random access machine instruction, such as: fetch some operands from
memory, perform an ALU operation on the data, and store the result back in memory.
All processors implicitly synchronize on each cycle and the synchronization overhead is assumed
to be zero.
Communication is done through reading and writing of shared variables.
Memory access can be specified to be UMA, NUMA, EREW, CREW, or CRCW with a defined
conflict policy.
The PRAM model can apply to SIMD class machines if all processors execute identical
instructions on the same cycle or to MIMD class machines if the processors are executing different
instructions.
Load imbalance is the only form of overhead in the PRAM model.
Exclusive Read (ER) — This allows at mast one processor to read from any memory location in
each cycle, a rather restrictive policy.
Exclusive Write (EW) — This allows at most one processor to write into a memory location at a
time.
Concurrent Read (CR) — This allows multiple processors to read the same information from the
same memory cell in the same cycle.
Notes by Shylaja B, Asst. Prof, Dept of CSE, DSATM, Bangalore 20
ACA (15CS72) Notes Module-1
Concurrent Write (CW) — This allows simultaneous writes to the same memory location.
In order to avoid confusion, some policy must be set up to resolve the write conflicts. Various
combinations of the above options lead to several variants of the PRAM model as specified below.
PRAM Variants
There are 4 variants of the PRAM model, depending on how the memory reads and writes are handled.
EREW – PRAM Model (Exclusive Read, Exclusive Write): This model forbids more than one
processor from reading or writing the same memory cell simultaneously. This is the most
restrictive PRAM model proposed.
CREW – PRAM Model (Concurrent Read, Exclusive Write); The write conflicts are avoided
by mutual exclusion. Concurrent reads to the same memory location arc allowed.
ERCW – PRAM Model – This allows exclusive read or concurrent writes to the same memory
location.
CRCW – PRAM Model (Concurrent Read, Concurrent Write); This model allows either
concurrent reads or concurrent writes to the same memory location.
Data dependence:
The ordering relationship between statements is indicated by the data dependence. Five type of data
dependence are defined below:
1. Flow dependence: A statement S2 is flow dependent on S1 if an execution path exists from s1 to
S2 and if at least one output (variables assigned) of S1feeds in as input (operands
to be used) to S2 also called RAW hazard and denoted as
3. Output dependence: Two statements are output dependent if they produce (write) the same output
variable. Also called WAW hazard and denoted as
4. I/O dependence: Read and write are I/O statements. I/O dependence occurs not because the same
variable is involved but because the same file referenced by both I/O statement.
5. Unknown dependence: The dependence relation between two statements cannot be determined in
the following situations:
• The subscript of a variable is itself subscribed (indirect addressing)
• The subscript does not contain the loop index variable.
• A variable appears more than once with subscripts having different coefficients of the loop
variable.
• The subscript is non linear in the loop index variable.
Parallel execution of program segments which do not have total data independence can produce non-
deterministic results.
Example: Consider the following fragment of a program:
S1: Load R1, A /R1 Memory(A) /
S2: Add R2, R1 /R2 (R1) + (R2)/
S3: Move R1, R3 /R1 (R3)/
S4: Store B, R1 /Memory(B) (R1)/
Control Dependence:
This refers to the situation where the order of the execution of statements cannot be determined
before run time.
For example all condition statement, where the flow of statement depends on the output.
Different paths taken after a conditional branch may depend on the data hence we need to eliminate
this data dependence among the instructions.
This dependence also exists between operations performed in successive iterations of looping
procedure. Control dependence often prohibits parallelism from being exploited.
Control-independent example:
for (i=0; i<n; i++)
{
a[i] = c[i];
if (a[i] < 0) a[i] = 1;
}
Control-dependent example:
for (i=1; i<n; i++)
{
if (a[i-1] < 0) a[i] = 1;
}
Control dependence also avoids parallelism to being exploited. Compilers are used to eliminate this
control dependence and exploit the parallelism.
Resource dependence:
Data and control dependencies are based on the independence of the work to be done.
Resource independence is concerned with conflicts in using shared resources, such as registers,
integer and floating point ALUs, etc. ALU conflicts are called ALU dependence.
Memory (storage) conflicts are called storage dependence.
Bernstein’s Conditions
Bernstein’s conditions are a set of conditions which must exist if two processes can execute in parallel.
Notation
Ii is the set of all input variables for a process Pi. Ii is also called the read set or domain of Pi. Oi is
the set of all output variables for a process Pi . Oi is also called write set.
If P1 and P2 can execute in parallel (which is written as P1 || P2), then:
In terms of data dependencies, Bernstein’s conditions imply that two processes can execute in
parallel if they are flow-independent, anti-independent, and output-independent.
The parallelism relation || is commutative (Pi || Pj implies Pj || Pi ), but not transitive (Pi || Pj and Pj
|| Pk does not imply Pi || Pk ) .
Therefore, || is not an equivalence relation. Intersection of the input sets is allowed.
Assume that each statement requires one step to execute. No pipelining is considered here. The
dependence graph shown in 2.2a demonstrates flow dependence as well as resource dependence. In
sequential execution, five steps are needed (Fig. 2.2b).
If two adders are available simultaneously, the parallel execution requires only 3 steps as shown in
Fig 2.2c.
Pairwise, there are 10 pairs of statements to check against Bernstein’s conditions. Only 5 pairs,
P1||P5, P2||P3, P2||P5, P5||P3 and P4||P5 can execute in parallel as revealed in Fig 2.2a if there are
no resource conflicts.
Collectively, only P2||P3||P5 is possible(Fig. 2.2c) because P2||P3, P3||P5 and P5||P2 are all
possible.
Software Parallelism
Software parallelism is defined by the control and data dependence of programs, and is revealed in the
program’s flow graph i.e., it is defined by dependencies with in the code and is a function of algorithm,
programming style, and compiler optimization.
Consider the example program graph in Fig. 2.3a. There are eight instructions (four loads and four
arithmetic operations) to be executed in three consecutive machine cycles.
Four load operations are performed in the first cycle, followed by two multiply operations in the
second cycle and two add/subtract operations in the third cycle.
Therefore, the parallelism varies from 4 to 2 in three cycles. The average software parallelism is
equal to 8/3 = 2.67 instructions per cycle in this example program.
Now consider execution of the same program by a two-issue processor which can execute one
memory access (load or write) and one arithmetic (add, subtract, multiply, etc.) operation
simultaneously.
With this hardware restriction, the program must execute in seven machine cycles as shown in Fig.
2.3b. Therefore, the hardware parallelism displays an average value of 8/7 = 1.14 instructions
executed per cycle.
This demonstrates a mismatch between the software parallelism and the hardware parallelism.
Let us try to match the software parallelism shown in Fig. 2.3a in a hardware platform of a dual-
processor system, where single-issue processors are used.
The achievable hardware parallelism is shown in Fig 2.4. Six processor cycles are needed to
execute 12 instructions by two processors.
S1 and S2 are two inserted store operations, l5 and l6 are two inserted load operations for
interprocessor communication through the shared memory.
Levels of Parallelism
Loop-level Parallelism
Typical loop has less than 500 instructions. If a loop operation is independent between iterations, it
can be handled by a pipeline, or by a SIMD machine.
Most optimized program construct to execute on a parallel or vector machine.
Some loops (e.g. recursive) are difficult to handle. Loop-level parallelism is still considered fine
grain computation.
Procedure-level Parallelism
Medium-sized grain; usually less than 2000 instructions.
Detection of parallelism is more difficult than with smaller grains; interprocedural dependence
analysis is difficult and history-sensitive.
Communication requirement less than instruction level SPMD (single procedure multiple data) is a
special case Multitasking belongs to this level.
Subprogram-level Parallelism
Job step level; grain typically has thousands of instructions; medium- or coarse-grain level.
Job steps can overlap across different jobs. Multiprograming conducted at this level No compilers
available to exploit medium- or coarse-grain parallelism at present.
Communication Latency
Balancing granularity and latency can yield better performance. Various latencies attributed to machine
architecture, technology, and communication patterns used.
Latency imposes a limiting factor on machine scalability.
Ex: Memory latency increases as memory capacity increases, limiting the amount of memory that can
be used with a given tolerance for communication latency.
Interprocessor Communication Latency
• Needs to be minimized by system designer
• Affected by signal delays and communication patterns Ex: n communicating tasks may require n
(n - 1)/2 communication links, and the complexity grows quadratically, effectively limiting the
number of processors in the system.
Communication Patterns
• Determined by algorithms used and architectural support provided
• Patterns include permutations broadcast multicast conference
• Tradeoffs often exist between granularity of parallelism and communication demand.
There is an obvious tradeoff between the time spent scheduling and synchronizing parallel grains and
the speedup obtained by parallel execution.
One approach to the problem is called ―grain packing.‖
Scheduling
A schedule is a mapping of nodes to processors and start times such that communication delay
requirements are observed, and no two nodes are executing on the same processor at the same time.
Some general scheduling goals are:
• Schedule all fine-grain activities in a node to the same processor to minimize communication
delays.
• Select grain sizes for packing to achieve better schedules for a particular parallel machine.
With respect to the fine-grain versus coarse-grain program graphs in Fig. 2.6, two multiprocessor
schedules are shown in Fig. 2.7. The fine-grain schedule is longer (42 time units) because more
communication delays were included as shown by the shaded area.
The coarse-grain schedule is shorter (38 time units) because communication delays among nodes
12, 13 and 14 within the same node D ( and also the delays among 15, 16 and 17 within the node
E) are eliminated after grain packing.
Node Duplication
Grain packing may potentially eliminate interprocessor communication, but it may not always
produce a shorter schedule.
By duplicating nodes (that is, executing some instructions on multiple processors), we may
eliminate some interprocessor communication, and thus produce a shorter schedule.
Figure 2.8a shows a schedule without duplicating any of the 5 nodes. This schedule contains idle
time as well as long interprocessor delays (8 units) between P1 and P2.
In Fig 2.8b, node A is duplicated into A’ and assigned to P2 besides retaining the original copy A
in P1.
Similarly, a duplictaed node C’ is copied into P1 besides the original node C in P2.
The new schedule is shown in Fig. 2.8b is almost 50% shorter than that in Fig. 2.8a. The reduction
in schedule time is caused by elimination of the (a, 8) and (c, 8) delays between the two processors.
Grain packing and node duplication are often used jointly to determine the best grain size and
corresponding schedule.
Four major steps are involved in the grain determination and the process of scheduling optimization:\
Step 1: Construct a fine-grain program graph
Step 2: Schedule the fine-grain computation
Step 3: Perform grain packing to produce the coarse grains.
Step 4: Generate a parallel schedule based on the packed graph.
Control flow machines used shared memory for instructions and data.
Since variables are updated by many instructions, there may be side effects on other instructions.
These side effects frequently prevent parallel processing.
Single processor systems are inherently sequential.
Instructions in dataflow machines are unordered and can be executed as soon as their operands are
available; data is held in the instructions themselves. Data tokens are passed from an instruction to
its dependents to trigger execution.
• Control flow mechanism: Conventional machines used control flow mechanism in which order of
program execution explicitly stated in user programs.
• Reduction machines trigger an instruction’s execution based on the demand for its results.
Control flow machines used shared memory for instructions and data. Since variables are updated by
many instructions, there may be side effects on other instructions. These side effects frequently prevent
parallel processing. Single processor systems are inherently sequential.
Instructions in dataflow machines are unordered and can be executed as soon as their operands are
available; data is held in the instructions themselves. Data tokens are passed from an instruction to its
dependents to trigger execution.
No need for
• shared memory
• program counter
• control sequencer
A Dataflow Architecture
• The Arvind machine (MIT) has N PEs and an N-by-N interconnection network.
• Each PE has a token-matching mechanism that dispatches only instructions with data tokens
available.
• Each datum is tagged with
o address of instruction to which it belongs
o context in which the instruction is being executed
• Tagged tokens enter PE through local path (pipelined), and can also be communicated to other PEs
through the routing network.
• Instruction address(es) effectively replace the program counter in a control flow machine.
• Context identifier effectively replaces the frame base register in a control flow machine.
• Since the dataflow machine matches the data tags from one instruction with successors,
synchronized instruction execution is implicit.
• An I-structure in each PE is provided to eliminate excessive copying of data structures.
• Each word of the I-structure has a two-bit tag indicating whether the value is empty, full or has
pending read requests.
• This is a retreat from the pure dataflow approach.
• Special compiler technology needed for dataflow machines.
Demand-Driven Mechanisms
Notes by Shylaja B, Asst. Prof, Dept of CSE, DSATM, Bangalore 36
ACA (15CS72) Notes Module-1
• Graph-reduction model:
o expression graph reduced by evaluation of branches or subgraphs, possibly in parallel,
with demanders given pointers to results of reductions.
o based on sharing of pointers to arguments; traversal and reversal of pointers continues
until constant arguments are encountered.
Various types of interconnection networks have been suggested for SIMD computers. These are
basically classified have been classified on network topologies into two categories namely
1. Static Networks
2. Dynamic Networks
The topology of an interconnection network can be either static or dynamic. Static networks are
formed of point-to-point direct connections which will not change during program execution.
Dynamic networks are implemented with switched channels, which are dynamically configured to
match the communication demand in user programs.
Packet switching and routing is playing an important role in modern multiprocessor architecture.
Bisection Width:
When a given network is cut into two equal halves, the minimum number of edges (channels) along the
cut is called the bisection width b. In the case of a communication network, each edge may correspond
to a channel with w bit wires.
To summarize the above discussions, the performance of an interconnection network is affected by the
following factors:
Functionality: refers to how the network supports data routing, interrupt handling, synchronization,
request-"message combining, and coherence.
Network Latency:- This refers to the worst-ease time delay for a unit message to be transferred
through the network.
Bandwidth: This refers to the maximum data transfer rate, in terms of Mbps or Gbps transmitted
through the network.
Hardware Complexity'—This refers to implementation costs such as those for wires, switches,
connectors, arbitration, and interface logic.
Scalability—This refers to the ability ofa network to be modularly expandable with a scalable
performance with increasing machine resources.
Digital Buses
• A bus system (contention bus, time-sharing bus) has
o a collection of wires and connectors
o multiple modules (processors, memories, peripherals, etc.) which connect to the wires
o data transactions between pairs of modules
MODULE-2
Processors can be ―mapped‖ to a space that has clock rate and cycles per instruction (CPI) as
coordinates. Each processor type occupies a region of this space.
Newer technologies are enabling higher clock rates.
Manufacturers are also trying to lower the number of cycles per instruction.
Thus the ―future processor space‖ is moving toward the lower right of the processor design space.
• Complex Instruction Set Computing (CISC) processors like the Intel 80486, the Motorola 68040,
the VAX/8600, and the IBM S/390 typically use microprogrammed control units, have lower clock
rates, and higher CPI figures than…
• Reduced Instruction Set Computing (RISC) processors like the Intel i860, SPARC, MIPS R3000,
and IBM RS/6000, which have hard-wired control units, higher clock rates, and lower CPI figures.
Superscalar Processors
• This subclass of the RISC processors allow multiple instructions to be issued simultaneously
during each cycle.
• The effective CPI of a superscalar processor should be less than that of a generic scalar RISC
processor.
• Clock rates of scalar RISC and superscalar RISC machines are similar.
VLIW Machines
• Very Long Instruction Word machines typically have many more functional units than superscalars
(and thus the need for longer – 256 to 1024 bits – instructions to provide control for them).
• These machines mostly use microprogrammed control units with relatively slow clock rates
because of the need to use ROM to hold the microcode.
Superpipelined Processors
• These processors typically use a multiphase clock (actually several clocks that are out of phase
with each other, each phase perhaps controlling the issue of another instruction) running at a
relatively high rate.
• The CPI in these machines tends to be relatively high (unless multiple instruction issue is used).
• Processors in vector supercomputers are mostly superpipelined and use multiple functional units
for concurrent scalar and vector operations.
Instruction Pipelines
• Instruction pipeline cycle – the time required for each phase to complete its operation
(assuming equal delay in all phases)
• Instruction issue latency – the time (in cycles) required between the issuing of two adjacent
instructions
• Instruction issue rate – the number of instructions issued per cycle (the degree of a
superscalar)
• Simple operation latency – the delay (after the previous instruction) associated with the
completion of a simple operation (e.g. integer add) as compared with that of a complex
operation (e.g. divide).
Notes by Shylaja B, Asst. Prof, Dept of CSE, DSATM, Bangalore 3
ACA (15CS72) Notes Module-2
• Resource conflicts – when two or more instructions demand use of the same functional unit(s)
at the same time.
Pipelined Processors
– can be fully utilized if instructions can enter the pipeline at a rate on one per cycle
• CPI rating is 1 for an ideal pipeline. Underpipelined systems will have higher CPI ratings,
lower clock rates, or both.
• Figure 4.3 shows the data path architecture and control unit of a typical, simple scalar processor
which does not employ an instruction pipeline. Main memory, I/O controllers, etc. are connected to
the external bus.
• The control unit generates control signals required for the fetch, decode, ALU operation, memory
access, and write result phases of instruction execution.
• The control unit itself may employ hardwired logic, or—as was more common in older CISC style
processors—microcoded logic.
• Modern RISC processors employ hardwired logic, and even modern CISC processors make use of
many of the techniques originally developed for high-performance RISC processors.
• CISC
– Many different instructions
– Many different operand data types
– Many different operand addressing formats
– Relatively small number of general purpose registers
– Many instructions directly match high-level language constructions
• RISC
– Many fewer instructions than CISC (freeing chip space for more functional units!)
– Fixed instruction format (e.g. 32 bits) and simple operand addressing
– Relatively large number of registers
– Small CPI (close to 1) and high clock rates
Architectural Distinctions
• CISC
– Unified cache for instructions and data (in most cases)
– Microprogrammed control units and ROM in earlier processors (hard-wired controls
units now in some CISC systems)
• RISC
– Separate instruction and data caches
– Hard-wired control units
• CISC Advantages
– Smaller program size (fewer instructions)
– Simpler control unit design
– Simpler compiler design
• RISC Advantages
– Has potential to be faster
– Many more registers
• RISC Problems
– More complicated register decoding system
– Hardwired control is less flexible than microcode
– VAX 8600
– Motorola MC68040
– Intel Pentium
• RISC and CISC scalar processors should have same performance if clock rate and program
lengths are equal.
• RISC moves less frequent operations into software, thus dedicating hardware resources to the
most frequently used operations.
– Sun SPARC
– Intel i860
– Motorola M88100
– AMD 29000
• SPARC family chips produced by Cypress Semiconductors, Inc. Figure 4.7 shows the architecture
of the Cypress CY7C601 SPARC processor and of the CY7C602 FPU.
• The Sun SPARC instruction set contains 69 basic instructions
• The SPARC runs each procedure with a set of thirty-two 32-bit IU registers.
• Eight of these registers are global registers shared by all procedures, and the remaining 24 are
window registers associated with only each procedure.
• The concept of using overlapped register windows is the most important feature introduced by the
Berkeley RISC architecture.
• Fig. 4.8 shows eight overlapping windows (formed with 64 local registers and 64 overlapped
registers) and eight globals with a total of 136 registers, as implemented in the Cypress 601.
• Each register window is divided into three eight-register sections, labeled Ins, Locals, and Outs.
• The local registers are only locally addressable by each procedure. The Ins and Outs are shared
among procedures.
• The calling procedure passes parameters to the called procedure via its Outs (r8 to r15) registers,
which are the Ins registers of the called procedure.
• The window of the currently running procedure is called the active window pointed to by a current
window pointer.
• A window invalid mask is used to indicate which window is invalid. The trap base register serves
as a pointer to a trap handler.
• A special register is used to create a 64-bit product in multiple step instructions. Procedures can
also be called without changing the window.
• The overlapping windows can significantly save the time required for interprocedure
communications, resulting in much faster context switching among cooperative procedures.
A CISC or a RISC scalar processor can be improved with a superscalar or vector architecture.
Scalar processors are those executing one instruction per cycle.
Only one instruction is issued per cycle, and only one completion of instruction is expected from
the pipeline per cycle.
In a superscalar processor, multiple instructions are issued per cycle and multiple results are
generated per cycle.
A vector processor executes vector instructions on arrays of data; each vector instruction involves a
string of repeated operations, which are ideal for pipelining with one result per cycle.
Superscalar processors are designed to exploit more instruction-level parallelism in user programs.
Only independent instructions can be executed in parallel without causing a wait state. The amount
of instruction level parallelism varies widely depending on the type of code being executed.
It has been observed that the average value is around 2 for code without loop unrolling. Therefore,
for these codes there is not much benefit gained from building a machine that can issue more than
three instructions per cycle.
The instruction-issue degree in a superscalar processor has thus been limited to 2 to 5 in practice.
• Code density in VLIW is less than in superscalars, because if a ―region‖ of a VLIW word isn’t
needed in a particular instruction, it must still exist (to be filled with a ―no op‖).
• Superscalars can be compatible with scalar processors; this is difficult with VLIW parallel and
non-parallel architectures.
VLIW Opportunities
• The efficiency of the machine is entirely dictated by the success, or ―goodness,‖ of the
compiler in planning the operations to be placed in the same instruction words.
• Different implementations of the same VLIW architecture may not be binary-compatible with
each other, resulting in different latencies.
VLIW Summary
• VLIW reduces the effort required to detect parallelism using hardware or software techniques.
• The main advantage of VLIW architecture is its simplicity in hardware structure and instruction
set.
• Unfortunately, VLIW does require careful analysis of code in order to ―compact‖ the most
appropriate ‖short‖ instructions into a VLIW word.
• Typical memory-to-memory vector instructions (using the same notation as given in the
previous slide) include these:
• Vector processors can usually effectively use large pipelines in parallel, the number of such
parallel pipelines effectively limited by the number of functional units.
• As usual, the effectiveness of a pipelined system depends on the availability and use of an
effective compiler to generate code that makes good use of the pipeline facilities.
Symbolic Processors
• Symbolic processors are somewhat unique in that their architectures are tailored toward the
execution of programs in languages similar to LISP, Scheme, and Prolog.
• In effect, the hardware provides a facility for the manipulation of the relevant data objects with
―tailored‖ instructions.
• These processors (and programs of these types) may invalidate assumptions made about more
traditional scientific and business computations.
Storage devices such as registers, caches, main memory, disk devices, and backup storage are often
organized as a hierarchy as depicted in Fig. 4.17.
The memory technology and storage organization at each level is characterized by five parameters:
faster to access,
are smaller in capacity,
are more expensive per byte,
have a higher bandwidth, and
have a smaller unit of transfer.
In general, ti-1 < ti, si-1 < si, ci-1 > ci, bi-1 > bi and xi-1 < xi for i = 1, 2, 3, and 4 in the
hierarchy where i = 0 corresponds to the CPU register level.
The cache is at level 1, main memory at level 2, the disks at level 3 and backup storage at level 4.
Caches
The cache is controlled by the MMU and is programmer-transparent.
The cache can also be implemented at one or multiple levels, depending on the speed and
application requirements.
Multi-level caches are built either on the processor chip or on the processor board.
Multi-level cache systems have become essential to deal with memory access latency.
Peripheral Technology
Peripheral devices include printers, plotters, terminals, monitors, graphics displays, optical
scanners, image digitizers, output microfilm devices etc.
Some I/O devices are tied to special-purpose or multimedia applications.
Information stored in a memory hierarchy (M1, M2,…, Mn) satisfies 3 important properties:
1. Inclusion
2. Coherence
3. Locality
We consider cache memory the innermost level M1, which directly communicates with the CPU
registers.
The outermost level Mn contains all the information words stored. In fact, the collection of all
addressable words in Mn forms the virtual address space of a computer.
Program and data locality is characterized below as the foundation for using a memory hierarchy
effectively.
• The inverse, however, is not necessarily true. That is, the presence of a data item in level Mi+1 does
not imply its presence in level Mi. We call a reference to a missing item a ―miss.‖
The requirement that copies of data items at successive memory levels be consistent is called the
―coherence property.‖
Coherence Strategies
• Write-through
• Write-back
– The update of the data item in Mi+1 corresponding to a modified item in Mi is not
updated unit it (or the block/page/etc. in Mi that contains it) is replaced or removed.
– This is the most efficient approach, but cannot be used (without modification) when
multiple processors share Mi+1, …, Mn.
3. Locality of References
• Memory references are generated by the CPU for either instruction or data access.
• These accesses tend to be clustered in certain regions in time, space, and ordering.
There are three dimensions of the locality property:
– Spatial locality – if location M is referenced at time t, then another location Mm will
be referenced at time t+t.
Working Sets
• The set of addresses (bytes, pages, etc.) referenced by a program during the interval from t to t+
t, where t is called the working set parameter, changes slowly.
• This set of addresses, called the working set, should be present in the higher levels of M if a
program is to execute efficiently (that is, without requiring numerous movements of data items
from lower levels of M). This is called the working set principle.
Hit Ratios
• When a needed item (instruction or data) is found in the level of the memory hierarchy being
examined, it is called a hit. Otherwise (when it is not found), it is called a miss (and the item
must be obtained from a lower level in the hierarchy).
• The hit ratio, h, for Mi is the probability (between 0 and 1) that a needed data item is found
when sought in level memory Mi.
• We assume h0 = 0 and hn = 1.
Access Frequencies
• There are different penalties associated with misses at different levels in the memory hierarcy.
Hierarchy Optimization
This implies that the cost is distributed over n levels. Since cl > c2 > c3 > … cn, we have to choose s1
< s2 < s3 < … sn.
The optimal design of a memory hierarchy should result in a Teff close to the t1 of M1 and a total
cost close to the cost of Mn.
The optimization process can be formulated as a linear programming problem, given a ceiling C0 on
the total cost— that is, a problem to minimize
• To facilitate the use of memory hierarchies, the memory addresses normally generated by
modern processors executing application programs are not physical addresses, but are rather
virtual addresses of data items and instructions.
• Physical addresses, of course, are used to reference the available locations in the real physical
memory of a system.
• Virtual addresses must be mapped to physical addresses before they can be used.
• The mapping from virtual to physical addresses can be formally defined as follows:
• The mapping returns a physical address if a memory hit occurs. If there is a memory miss, the
referenced item has not yet been brought into primary memory.
Mapping Efficiency
• The efficiency with which the virtual to physical mapping can be accomplished significantly
affects the performance of the system.
– In this scheme, each processor has a separate virtual address space, but all processors share the
same physical address space.
– Advantages:
– Disadvantages
• The same virtual address in different virtual spaces may point to different pages in
physical memory
• All processors share a single shared virtual address space, with each processor being given a
portion of it.
Advantages:
Disadvantages
• Processors must be capable of generating large virtual addresses (usually > 32 bits)
• Since the page table is shared, mutual exclusion must be used to guarantee atomic updates
• Segmentation must be used to confine each process to its own address space
• The address translation process is slower than with private (per processor) virtual memory
Memory Allocation
Both the virtual address space and the physical address space are divided into fixed-length pieces.
• The purpose of memory allocation is to allocate pages of virtual memory using the page frames of
physical memory.
Both the virtual memory and physical memory are partitioned into fixed-length pages. The purpose of
memory allocation is to allocate pages of virtual memory to the page frames of the physical memory.
The process demands the translation of virtual addresses into physical addresses. Various schemes
for virtual address translation are summarized in Fig. 4.21a.
The translation demands the use of translation maps which can be implemented in various ways.
Translation maps are stored in the cache, in associative memory, or in the main memory.
To access these maps, a mapping function is applied to the virtual address. This function generates
a pointer to the desired translation map.
This mapping can be implemented with a hashing or congruence function.
Hashing is a simple computer technique for converting a long page number into a short one with
fewer bits.
The hashing function should randomize the virtual page number and produce a unique hashed
number to be used as the pointer.
Our purpose is to produce the physical address consisting of the page frame number, the block
number, and the word address.
The first step of the translation is to use the virtual page number as a key to search through the TLB
for a match.
The TLB can be implemented with a special associative memory (content addressable memory) or
use part of the cache memory.
Notes by Shylaja B, Asst. Prof, Dept of CSE, DSATM, Bangalore 26
ACA (15CS72) Notes Module-2
In case of a match (a hit) in the TLB, the page frame number is retrieved from the matched page
entry. The cache block and word address are copied directly.
In case the match cannot be found (a miss) in the TLB, a hashed pointer is used to identify one of
the page tables where the desired page frame number can be retrieved.
1. Paging
2. Segmentation
1. Paging memory
• Main memory contains some number of pages which is smaller than the number of pages in the
virtual memory.
• For example, if the page size is 2K and the physical memory is 16M (8K pages) and the virtual
memory is 4G (2 M pages) then there is a factor of 254 to 1 mapping.
• A page map table is used for implementing a mapping, with one entry per virtual page.
2. Segmented memory
• In a segmented memory management system the blocks to be replaced in main memory are
potentially of unequal length and here the segments correspond to logical blocks of code or data.
• Segments, then, are ``atomic'' in the sense that either the whole segment should be in main
memory, or none of the segment should be there.
• The segments may be placed anywhere in main memory, but the instructions or data in one
segment should be contiguous,
3. Paged Segmentation
• Within each segment, the addresses are divided into fixed size pages
– Segment Number
– Page Number
– Offset
Inverted paging
• Besides direct mapping, address translation maps can also be implemented with inverted mapping
(Fig. 4.21c).
• An inverted page table is created for each page frame that has been allocated to users. Any virtual
page number can be paired with a given physical page number.
• Inverted page tables are accessed either by an associative search or by the use of a hashing
function.
• In using an inverted PT, only virtual pages that are currently resident in physical memory are
included. This provides a significant reduction in the size of the page tables.
• The generation of a long virtual address from a short physical address is done with the help of
segment registers, as demonstrated in Fig. 4.21c.
• The leading 4 bits (denoted sreg) of a 32-bit address name a segment register.
• The register provides a segment id that replaces the 4-bit sreg to form a long virtual address.
• This effectively creates a single long virtual address space with segment boundaries at multiples of
256 Mbytes (228 bytes).
• The IBM RT/PC had a 12-bit segment id (4096 segments) and a 40-bit virtual address space.
• Either associative page tables or inverted page tables can be used to implement inverted mapping.
Notes by Shylaja B, Asst. Prof, Dept of CSE, DSATM, Bangalore 28
ACA (15CS72) Notes Module-2
• The inverted page table can also be assisted with the use of a TLB. An inverted PT avoids the use
of a large page table or a sequence of page tables.
• Given a virtual address to be translated, the hardware searches the inverted PT for that address and,
if it is found, uses the table index of the matching entry as the address of the desired page frame.
• A hashing table is used to search through the inverted PT.
• The size of an inverted PT is governed by the size of the physical space, while that of traditional
PTs is determined by the size of the virtual space.
• Because of limited physical space, no multiple levels are needed for the inverted page table.
• Memory management policies include the allocation and deallocation of memory pages to active
processes and the replacement of memory pages.
• Demand paging memory systems. refers to the process in which a resident page in main memory is
replaced by a new page transferred from the disk.
• Since the number of available page frames is much smaller than the number of pages, the frames
will eventually be fully occupied.
• In order to accommodate a new page, one of the resident pages must be replaced.
• The goal of a page replacement policy is to minimize the number of possible page faults so that the
effective memory-access time can be reduced.
• The effectiveness of a replacement algorithm depends on the program behavior and memory traffic
patterns encountered.
• A good policy should match the program locality property. The policy is also affected by page size
and by the number of available frames.
Page Traces: A page trace is a sequence of page frame numbers (PFNs) generated during the
execution of a given program.
The following page replacement policies are specified in a demand paging memory system for a page
fault at time t.
(1) Least recently used (LRU)—This policy replaces the page in R(t) which has the longest backward
distance:
(2) Optimal (OPT) algorithm—This policy replaces the page in R(t) with the longest forward
distance:
(3) First-in-first-out (FIFO)—This policy replaces the page in R(t) which has been in memory for the
longest time.
(4) Least frequently used (LFU)—This policy replaces the page in R(t) which has been least
referenced in the past.
(5) Circular FIFO—This policy joins all the page frame entries into a circular FIFO queue using a
pointer to indicate the front of the queue.
• An allocation bit is associated with each page frame. This bit is set upon initial allocation of a page
to the frame.
• When a page fault occurs, the queue is circularly scanned from the pointer position.
• The pointer skips the allocated page frames and replaces the very first unallocated page frame.
• When all frames are allocated, the front of the queue is replaced, as in the FIFO policy.
(6) Random replacement—This is a trivial algorithm which chooses any page for replacement
randomly.
Example:
Consider a paged virtual memory system with a two-level hierarchy: main memory M1 and disk
memory M2.
Assume a page size of four words. The number of page frames in M1 is 3, labeled a, b and c; and the
number of pages in M2 is 10, identified by 0, 1, 2,….9. The ith page in M2consists of word
addresses 4i to 4i + 3 for all i = 0, 1, 2, …, 9.
A certain program generates the following sequence of word addresses which are grouped (underlined)
together if they belong to the same page. The sequence of page numbers so formed is the page trace:
Page tracing experiments are described below for three page replacement policies: LRU, OPT, and
FIFO, respectively. The successive pages loaded in the page frames (PFs) form the trace entries.
Initially, all PFs are empty.
Module-III
A backplane bus interconnects processors, data storage and peripheral devices in a tightly coupled
hardware.
The system bus must be designed to allow communication between devices on the devices on the
bus without disturbing the internal activities of all the devices attached to the bus.
Timing protocols must be established to arbitrate among multiple requests. Operational rules must
be set to ensure orderly data transfers on the bus.
Signal lines on the backplane are often functionally grouped into several buses as shown in Fig 5.1.
Various functional boards are plugged into slots on the backplane. Each slot is provided with one
or more connectors for inserting the boards as demonstrated by the vertical arrows.
• Data address and control lines form the data transfer bus (DTB) in VME bus.
• Address lines broadcast data and device address
– Proportional to log of address space size
• Data lines proportional to memory word length
• Control lines specify read/write, timing, and bus error conditions
The process of assigning control of the DTB to a requester is called arbitration. Dedicated lines are
reserved to coordinate the arbitration process among several requesters.
The requester is called a master, and the receiving end is called a slave.
Interrupt lines are used to handle interrupts, which are often prioritized. Dedicated lines may be
used to synchronize parallel activities among the processor modules.
Utility lines include signals that provide periodic timing (clocking) and coordinate the power-up
and power-down sequences of the system.
The backplane is made of signal lines and connectors.
A special bus controller board is used to house the backplane control logic, such as the system
clock driver, arbiter, bus timer, and power driver.
Functional Modules
A functional module is a collection of electronic circuitry that resides on one functional board (Fig.
5.1) and works to achieve special bus control functions.
Special functional modules are introduced below:
• Arbiter is a functional module that accepts bus requests from the requester module and grants
control of the DTB to one requester at a time.
• Bus timer measures the time each data transfer takes on the DTB and terminates the DTB cycle if
a transfer takes too long.
• Interrupter module generates an interrupt request and provides status/ID information when an
interrupt handler module requests it.
• Location monitor is a functional module that monitors data transfers over the DTB. A power
monitor watches the status of the power source and signals when power becomes unstable.
• System clock driver is a module that provides a clock timing signal on the utility bus. In addition,
board interface logic is needed to match the signal line impedance, the propagation time, and
termination values between the backplane and the plug-in boards.
Physical Limitations
• Due to electrical, mechanical, and packaging limitations, only a limited number of boards can be
plugged into a single backplane.
• Multiple backplane buses can be mounted on the same backplane chassis.
• The bus system is difficult to scale, mainly limited by packaging constraints.
• Two types of printed circuit boards connected to a bus: active and passive
• Active devices like processors can act as bus masters or as slaves at different times.
• Passive devices like memories can act only as slaves.
• The master can initiate a bus cycle
– Only one can be in control at a time
• The slaves respond to requests by a master
– Multiple slaves can respond
Bus Addressing
• The backplane bus is driven by a digital clock with a fixed cycle time: bus cycle
• Backplane has limited physical size, so will not skew information
Synchronous Timing
• All bus transaction steps take place at fixed clock edges as shown in Fig. 5.3a.
• The clock signals are broadcast to all potential masters and slaves.
• Clock cycle time determined by slowest device on bus
• Once the data becomes stabilized on the data lines, the master uses Data-ready pulse to initiate
the transfer
• The Slave uses Data-accept pulse to signal completion of the information transfer.
• Simple, less circuitry, suitable for devices with relatively the same speed.
Asynchronous Timing
Arbitration
• Simple scheme
• Easy to add devices
• Fixed-priority sequence – not fair
• Propagation of bus-grant signal is slow
• Not fault tolerant
• Provide independent bus-request and grant signals for each master as shown in Fig5.5a.
• No daisy chaining is used in this scheme.
• Require a central arbiter, but can use a priority or fairness based policy
• More flexible and faster than a daisy-chained policy
• Larger number of lines – costly
Distributed Arbitration
• Each master has its own arbiter and unique arbitration number as shown in Fig. 5.5b.
• Uses arbitration number to resolve arbitration competition
• When two or more devices compete for the bus, the winner is the one whose arbitration number is
the largest determined by Parallel Contention Arbitration..
• All potential masters can send their arbitration number to shared-bus request/grant (SBRG) lines
and compare its own number with SBRG number.
• If the SBRG number is greater, the requester is dismissed. At the end, the winner’s arbitration
number remains on the arbitration bus. After the current bus transaction is completed, the winner
seizes control of the bus.
• Priority based scheme
Transfer Modes
Interrupt Mechanisms
• Interrupt: is a request from I/O or other devices to a processor for service or attention
• A priority interrupt bus is used to pass the interrupt signals
• Interrupter must provide status and identification information
• Have an interrupt handler for each request line
• Interrupts can be handled by message passing on data lines on a time-sharing basis.
– Save lines, but use cycles
– Use of time-shared data bus lines is a virtual-interrupt
Standard Requirements
The major objectives of the Futurebus+ standards committee were to create a bus standard that would
provide a significant step forward in improving the facilities and performance available to the
designers of multiprocessor systems.
Below are the design requirements set by the IEEE 896.1-1991 Standards Committee to provide a
stable platform on which several generations of computer systems could be based:
• Independence for an open standard
• Asynchronous timing protocol
• Optional packet protocol
• Distributed arbitration protocols
• Support of high reliability and fault tolerant applications
• Ability to lock modules without deadlock or livelock
• Circuit-switched and split transaction protocols
• Support of real-time mission critical computations w/multiple priority levels
• 32 or 64 bit addressing
• Direct support of snoopy cache-based multiprocessors.
• Compatible message passing protocols
• Most multiprocessor systems use private caches for each processor as shown in Fig. 5.6
• Have an interconnection network between caches and main memory
• Caches can be addressed using either a Physical Address or Virtual Address.
• Two different cache design models are:
– Physical address cache
– Virtual address cache
Advantages:
• No cache flushing on a context switch
• No aliasing problem thus fewer cache bugs in OS kernel.
• Simplistic design
• Requires little intervention from OS kernel
Disadvantages:
Slowdown in accessing the cache until the MMU/TLB finishes translating the address
• When a cache is indexed or tagged with virtual address it is called virtual address cache.
• In this model both cache and MMU translation or validation are done in parallel.
• The physical address generated by the MMU can be saved in tags for later write back but is not
used during the cache lookup operations.
Advantages:
• do address translation only on a cache miss
• faster for hits because no address translation
• More efficient access to cache
Disadvantages:
• Cache flushing on a context switch (example : local data segments will get an erroneous hit for
virtual addresses already cached after changing virtual address space, if no cache flushing).
• Aliasing problem (several different virtual addresses cannot span the same physical addresses
without being duplicated in cache).
The block field (r bits) is used to implement the (modulo-m) placement, where m=2r
Once the block Bi is uniquely identified by this field, the tag associated with the addressed block is
compared with the tag in the memory address.
• Advantages
– Simple hardware
– No associative search
– No page replacement policy
– Lower cost
– Higher speed
• Disadvantages
– Rigid mapping
– Poorer hit ratio
– Prohibits parallel virtual address translation
– Use larger cache size with more block frames to avoid contention
• Advantages:
– Offers most flexibility in mapping cache blocks
– Higher hit ratio
– Allows better block replacement policy with reduced block contention
• Disadvantages:
– Higher hardware cost
– Only moderate size cache
– Expensive search process
• Compare the tag with the k tags within the identified set as shown in Fig 5.11a.
• Since k is rather small in practice, the k-way associative search is much more economical than
the full associativity.
• In general, a block Bj can be mapped into any one of the available frames Bf in a set Si defined
below.
Bj Bf Si if j(mod v) = i
• The matched tag identifies the current block which resides in the frame.
• When the contents of the block frame are replaced from a new sector, the remaining block
frames in the same sector are marked invalid. Only the block frames from the most recently
referenced sector are marked valid for reference.
Advantages:
• Flexible to implement various bkock replacement algorithms
• Economical to perform a fully associative search a limited number of sector tags.
• Sector partitioning offers more freedom in grouping cache lines at both ends of the mapping.
Cycle counts
• This refers to the number of basic machine cycles needed for cache access, update and coherence
control.
• Cache speed is affected by underlying static or dynamic RAM technology, the cache organization
and the cache hit ratios.
• The write through or write back policy also affect the cycle count.
• Cache size, block size, set number, and associativity affect count
• The cycle count is directly related to the hit ratio, which decreases almost linearly with increasing
values of above cache parameters.
Hit ratio
• The hit ratio is number of hits divided by total number of CPU references to memory (hits plus
misses).
• Hit ratio is affected by cache size and block size
• Increases w.r.t. increasing cache size
• Limited cache size, initial loading, and changes in locality prevent 100% hit ratio
Memory Interleaving
• The main memory is built with multiple modules.
• These memory modules are connected to a system bus or a switching network to which other
resources such as processors or I/O devices are also connected.
• Once presented with a memory address, each memory module returns with one word per cycle.
• It is possible to present different addresses to different memory modules so that parallel access of
multiple words can be done simultaneously or in a pipelined fashion.
Consider a main memory formed with m = 2a memory modules, each containing w = 2b words of
memory cells. The total memory capacity is m.w = 2a+b words.
These memory words are assigned linear addresses. Different ways of assigning linear addresses result
in different memory organizations.
Besides random access, the main memory is often block-accessed at consecutive addresses.
Low-order interleaving
• Low-order interleaving spreads contiguous memory locations across the m modules horizontally
(Fig. 5.15a).
• This implies that the low-order a bits of the memory address are used to identify the memory
module.
• The high-order b bits are the word addresses (displacement) within each module.
• Note that the same word address is applied to all memory modules simultaneously. A module
address decoder is used to distribute module addresses.
High-order interleaving
• High-order interleaving uses the high-order a bits as the module address and the low-order b bits as
the word address within each module (Fig. 5.15b).
• Contiguous memory locations are thus assigned to the same memory module. In each memory
cycle, only one word is accessed from each module.
• Thus the high-order interleaving cannot support block access of contiguous locations.
• An eight-way interleaved memory (with m=8 and w=8 and thus a=b=3) is shown in Fig. 5.16a.
• Let be the major cycle and the minor cycle. These two cycle times are related as follows:
= /m
m=degree of interleaving
=total time to complete access of one word
=actual time to produce one word
Total block access time is 2
Effective access time of each word is
• The timing of the pipelined access of the 8 contiguous memory words is shown in Fig. 5.16b.
• This type of concurrent access of contiguous words has been called a C-access memory scheme.
(5.5)
where m is the number of interleaved memory modules.
This equation implies that if 16 memory modules are used, then the effective memory bandwidth is
approximately four times that of a single module.
This pessimistic estimate is due to the fact that block access of various lengths and access of single
words are randomly mixed in user programs.
Hellerman's estimate was based on a single-processor system. If memory-access conflicts from
multiple processors (such as the hot spot problem) are considered, the effective memory bandwidth
will be further reduced.
In a vector processing computer, the access time of a long vector with n elements and stride
distance 1 has been estimated by Cragon (1992) as follows:
It is assumed that the n elements are stored in contiguous memory locations in an m-way
interleaved memory system.
(5.6)
Fault Tolerance
High- and low-order interleaving can be combined to yield many different interleaved memory
organizations.
Sequential addresses are assigned in the high-order interleaved memory in each memory module.
This makes it easier to isolate faulty memory modules in a memory bank of m memory modules.
When one module failure is detected, the remaining modules can still bo used by opening a
window in the address space.
This fault isolation cannot be carried out in a low-order interleaved memory, in which a module
failure may paralyze the entire memory bank.
Thus low-order interleaving memory is not fault-tolerant.
• Global allocation: considers the history of the working sets of all resident processes in making
a swapping decision
Swapping Systems
• Allow swapping only at entire process level
• Swap device: configurable section of a disk set aside for temp storage of data swapped
• Swap space: portion of disk set aside
• Depending on system, may swap entire processes only, or the necessary pages
Swapping in UNIX
• System calls that result in a swap:
– Allocation of space for child process being created
– Increase in size of a process address space
– Increased space demand by stack for a process
– Demand for space by a returning process swapped out previously
• Special process 0 is the swapper
• Allows only pages to be transferred b/t main memory and swap device
• Pages are brought in only on demand
• Allows process address space to be larger than physical address space
• Offers flexibility to dynamically accommodate large # of processes in physical memory on
time-sharing basis
Working Sets
• Set of pages referenced by the process during last n memory refs (n=window size)
• Only working sets of active processes are resident in memory
Other Policies
• Memory inconsistency: when memory access order differs from program execution order
• Sequential consistency: memory accesses (I and D) consistent with program execution order
Event Orderings
• Processes: concurrent instruction streams executing on different processors
• Consistency models specify the order by which events from one process should be observed by
another
• Event ordering helps determine if a memory event is legal for concurrent accesses
• Program order: order by which memory access occur for execution of a single process, w/o
any reordering
The event ordering can he used to declare whether a memory event is legal or illegal, when several
processes are accessing a common set of memory locations.
A program order is the order by which memory accesses occur for the execution of a single process,
provided that no program reordering has taken place.
Three primitive memory operations for the purpose of specifying memory consistency models are
defined:
(1) A load by processor Pi is considered performed with respect to processor Pk at a point of time
when the issuing of a store to the same location by Pk cannot affect the value returned by the load.
(2) A store by P, is considered performed with respect to Pk at one time when an issued load to the
same address by Pk returns the value by this store.
(3) A load is globally performed if it is performed with respect to all processors and if the store that is
the source of the returned value has been performed with respect to all processors.
As illustrated in Fig. 5.19a, a processor can execute instructions out of program order using a
compiler to resequence instructions in order to boost performance.
A uniprocessor system allows these out-of-sequence executions provided that hardware interlock
mechanisms exist to check data and control dependences between instructions.
When a processor in a multiprocessor system executes a concurrent program as illustrated in Fig.
5.19b, local dependence checking is necessary but may not be sufficient to preserve the intended
outcome of a concurrent execution.
(c) If accesses are not atomic with multiple copies of the same data coexisting as in a cache-based
system, then different processors can individually observe different interleavings during the same
execution. In this case, the total number of possible execution instantiations of a program becomes
even larger.
Atomicity
Three categories of multiprocessor memory behavior:
• Program order preserved and uniform observation sequence by all processors
• Out-of-program-order allowed and uniform observation sequence by all processors
• Out-of-program-order allowed and nonuniform sequences observed by different processors
Atomic memory accesses: memory updates are known to all processors at the same time
Non-atomic: having individual program orders that conform is not a sufficient condition for sequential
consistency
– Multiprocessor cannot be strongly ordered
Implementation Considerations
• A single port software services one op at a time
• Order in which software is thrown determines global order of memory access ops
• Strong ordering preserves the program order in all processors
• Sequential consistency model leads to poor memory performance due to the imposed strong
ordering of memory events
• Multiprocessor model may range from strong (sequential) consistency to various degrees of
weak consistency
• Two models considered
– DSB (Dubois, Scheurich and Briggs) model
– TSO (Total Store Order) model
DSB Model
Dubois, Scheurich and Briggs have derived a weak consistency model by relating memory request
ordering to synchronization points in the program. We call this the DSB model specified by the
following 3 conditions:
1. All previous synchronization accesses must be performed, before a load or a store access is
allowed to perform wrt any other processor.
2. All previous load and store accesses must be performed, before a synchronization access is
allowed to perform wrt any other processor.
3. Synchronization accesses sequentially consistent with respect to one another
TSO Model
Sindhu, Frailong and Cekleov have specified the TSO weak consistency model with 6 behavioral
axioms.
1. Load returns latest store result
2. Memory order is a total binary relation over all pairs of store operations
3. If two stores appear in a particular program order, then they must also appear in the same
memory order
4. If a memory operation follows a load in program order, then it must also follow load in
memory order
5. A swap operation is atomic with respect to other stores – no other store can interleave between
load/store parts of swap
6. All stores and swaps must eventually terminate.
Asynchronous Model
As shown in the figure data flow between adjacent stages in an asynchronous pipeline is controlled
by a handshaking protocol.
When stage Si is ready to transmit, it sends a ready signal to stage Si+1. After stage receives the
incoming data, it returns an acknowledge signal to Si.
Asynchronous pipelines are useful in designing communication channels in message- passing
multicomputers where pipelined wormhole routing is practiced Asynchronous pipelines may have
a variable throughput rate.
Different amounts of delay may be experienced in different stages.
Synchronous Model:
Synchronous pipelines are illustrated in Fig. Clocked latches are used to interface between stages.
The latches are made with master-slave flip-flops, which can isolate inputs from outputs.
Upon the arrival of a clock pulse All latches transfer data to the next stage simultaneously.
The pipeline stages are combinational logic circuits. It is desired to have approximately equal
delays in all stages.
These delays determine the clock period and thus the speed of the pipeline. Unless otherwise
specified, only synchronous pipelines are studied.
The utilization pattern of successive stages in a synchronous pipeline is specified by a reservation
table.
For a linear pipeline, the utilization follows the diagonal streamline pattern shown in Fig. 6.1c.
This table is essentially a space-time diagram depicting the precedence relationship in using the
pipeline stages.
Successive tasks or operations are initiated one per cycle to enter the pipeline. Once the pipeline is
filled up, one result emerges from the pipeline for each additional cycle.
This throughput is sustained only if the successive tasks are independent of each other.
At the rising edge of the clock pulse, the data is latched to the master flip-flops of each latch
register. The clock pulse has a width equal to d.
In general, τm >> d by one to two orders of magnitude.
This implies that the maximum stage delay τm dominates the clock period. The pipeline frequency
is defined as the inverse of the clock period.
f=1/τ
If one result is expected to come out of the pipeline per cycle, f represents the maximum
throughput of the pipeline.
Depending on the initiation rate of successive tasks entering the pipeline, the actual throughput of
the pipeline may be lower than f.
This is because more than one clock cycle has elapsed between successive task initiations.
Clock Skewing:
Ideally, we expect the clock pulses to arrive at all stages (latches) at the same time.
However, due to a problem known as clock skewing the same clock pulse may arrive at different
stages with a time offset of s.
Let tmax be the time delay of the longest logic path within a stage
tmin is the shortest logic path within a stage.
To avoid a race in two successive stages, we must choose
τm >= tmax + s and d <= tmin - s
These constraints translate into the following bounds on the clock period when clock skew takes
effect:
d + tmax + s <= τ <= τm + tmin - s
In the ideal case s = 0, tmax = τm, and tmin = d. Thus, we have τ= τm + d
Speedup Factor
The speedup factor of a k-stage pipeline over an equivalent nonpipelined processor is defined as
However, function partitioning in a dynamic pipeline becomes quite involved because the pipeline
stages are interconnected with loops in addition to streamline connections.
A multifunction dynamic pipeline is shown in Fig 6.3a. This pipeline has three stages.
Besides the streamline connections from S1 to S2 and from S2 to S3, there is a feed forward
connection from S1 to S3 and two feedback connections from S3 to S2 and from S3 to S1.
These feed forward and feedback connections make the scheduling of successive events into the
pipeline a nontrivial task.
With these connections, the output of the pipeline is not necessarily from the last stage.
In fact, following different dataflow patterns, one can use the same pipeline to evaluate different
functions
Reservation Tables:
The reservation table for a static linear pipeline is trivial in the sense that data flow follows a linear
streamline.
The reservation table for a dynamic pipeline becomes more interesting because a nonlinear pattern
is followed.
Given a pipeline configuration, multiple reservation tables can be generated for the evaluation of
different functions.
Two reservation tables are given in Fig6.3b and 6.3c, corresponding to a function X and a function
Y, respectively.
Each function evaluation is specified by one reservation table. A static pipeline is specified by a
single reservation table.
A dynamic pipeline may be specified by more than one reservation table. Each reservation table
displays the time-space flow of data through the pipeline for one function evaluation.
Different functions may follow different paths on the reservation table.
A number of pipeline configurations may be represented by the same reservation table.
There is a many-to-many mapping between various pipeline configurations and different
reservation tables.
The number of columns in a reservation table is called the evaluation time of a given function.
Latency Analysis
The number of time units (clock cycles) between two initiations of a pipeline is the latency
between them.
Latency values must be non negative integers. A latency of k means that two initiations are
separated by k clock cycles.
Any attempt by two or more initiations to use the same pipeline stage at the same time will cause a
collision.
A collision implies resource conflicts between two initiations in the pipeline. Therefore, all
collisions must be avoided in scheduling a sequence of pipeline initiations.
Some latencies will cause collisions, and some will not.
Latencies that cause collisions are called forbidden latencies.
Load operation (LD R2, M) and replaces it with the move operation (MOVE R2, R1).
Hazard Avoidance
• The read and write of shared variables by different instructions in a pipeline may lead to different
results if these instructions are executed out of order.
• As shown in Fig. 6.15, three types of logic hazards are possible:
• Consider two instructions I and J. Instruction J is assumed to logically follow instruction I
according to program order.
• If the actual execution order of these two instructions violate the program order, incorrect results
may be read or written, thereby producing hazards.
• Hazards should be prevented before these instructions enter the pipeline, such as by holding
instruction J until the dependence on instruction I is resolved.
• We use the notation D(I) and R(I) for the domain and range of an instruction I.
– Domain contains the Input Set to be used by instruction I
– Range contains the Output Set of instruction I
Listed below are conditions under which possible hazards can occur:
MODULE-IV
Parallel processing demands the use of efficient system interconnects for fast communication
among multiple processors and shared memory, I/O and peripheral devices.
Hierarchical buses, crossbar switches and multistage networks are often used for this purpose.
A generalized multiprocessor system is depicted in Fig. 7.1. This architecture combines features
from the UMA, NUMA and COMA models.
Each processor Pi is attached to its own local memory and private cache.
These multiple processors connected to share memory through interprocessor memory network
(IPMN).
Processors share the access of I/O and peripheral devices through Processor-I/O Network (PION).
Both IPMN and PION are necessary in a shared-resource multiprocessor.
An optional Interprocessor Communication Network (IPCN) can permit processor communication
without using shared memory.
Network Characteristics
The networks are designed with many choices like timing, switching and control strategy like in case
of dynamic network the multiprocessors interconnections are under program control.
Timing
Synchronous – controlled by a global clock which synchronizes all network activity.
Asynchronous – use handshaking or interlock mechanisms for communication and
especially suitable for coordinating devices with different speed.
Switching Method
Circuit switching – a pair of communicating devices control the path for the entire duration
of data transfer
Packet switching – large data transfers broken into smaller pieces, each of which can
compete for use of the path
Network Control
Centralized – global controller receives and acts on requests
Distributed – requests handled by local devices independently
A bus system consists of a hierarchy of buses connecting various system and subsystem
components in a computer.
Each bus is formed with a number of signal, control, and power lines. Different buses are used to
perform different interconnection functions.
In general, the hierarchy of bus systems are packaged at different levels as depicted in Fig. 7.2,
including local buses on boards, backplane buses, and I/O buses.
Local Bus Buses implemented on printed-circuit boards are called local buses.
On a processor board one often finds a local bus which provides a common communication path
among major components (chips) mounted on the board.
A memory board uses a memory bus to connect the memory with the interface logic.
An I/O board or network interface board uses a data bus. Each of these board buses consists of
signal and utility lines.
Backplane Bus
A backplane is a printed circuit on which many connectors are used to plug in functional boards. A
system bus, consisting of shared signal pathsand utility lines, is built on the backplane.This system bus
provides a common communication path among all plug-in boards.
I/O Bus
Input/Output devices are connected to a comuter system through an I/O bus such as the SCSI(Small
Computer Systems Interface) bus.
This bus is made of coaxial cables with taps connecting disks, printer and other devices to a processor
through an I/O controller.
Special interface logic is used to connect various board types to the backplane bus.
Each single cluster operates on a single-bus system. Snoopy bus coherence protocols can be used to
establish consistency among first level caches belonging to the same cluster.
Second level caches are used to extend consistency from each local cluster to the upper level.
The upper level caches form another level of shared memory between each cluster and the main
memory modules connected to the intercluster bus.
Most memory requests should be satisfied at the lower level caches.
Intercluster cache coherence is controlled among the second-level caches and the resulting effects
are passed to the lower level.
This is because even if two processors attempted to access the same memory module (or I/O device) at the
same time, only one of the requests is serviced at a time.
Multistage Networks
Multistage networks consist of multiple sages of switch boxes, and should be able to connect any input to
any output.
A multistage network is called blocking if the simultaneous connections of some multiple input/output
pairs may result in conflicts in the use of switches or communication links.
A nonblocking multistage network can perform all possible connections between inputs and outputs by
rearranging its connections.
Crossbar Networks
Crossbar networks connect every input to every output through a crosspoint switch. A crossbar network is a
single stage, non-blocking permutation network.
In an n-processor, m-
unary switch which can be open or closed, providing a point-to-point connection path between the
processor and a memory module.
Crosspoint switches must be designed to handle the potential contention for each memory module. A
2
crossbar switch avoids competition for bandwidth by using O(N ) switches to connect N inputs to N
outputs.
Although highly non-scalable, crossbar switches are a popular mechanism for connecting a small number
of workstations, typically 20 or fewer.
Each processor provides a request line, a read/write line, a set of address lines, and a set of data lines to a
crosspoint switch for a single column. The crosspoint switch eventually responds with an
acknowledgement when the access has been completed.
Multiport Memory
Since crossbar switches are expensive and not suitable for systems with many processors or memory
modules, multiport memory modules may be used instead.
A multiport memory module has multiple connection points for processors (or I/O devices), and the
memory controller in the module handles the arbitration and switching that might otherwise have been
accomplished by a crosspoint switch.
A two function switch can assume only two possible state namely state or exchange states. However a four
function switch box can be any of four possible states. A multistage network is capable of connecting any
input terminal to any output terminal. Multi-stage networks are basically constructed by so called shuffle-
exchange switching element, which is basically a 2 x 2 crossbar. Multiple layers of these elements are
connected and form the network.
Data routing is controlled by inspecting the destination code in binary. When the ith high-order bit of
the destination code is a 0, a 2 x 2 switch at stage i connects the input to the upper output.
Otherwise, the input is directed to the lower output.
Two switch settings are shown in Figs. 7.8a and b with respect to permutations Π1 = (0,7,6,4,2)
(1,3)(5) and Π2= (0,6,4,7,3) (1,5)(2), respectively.
The switch settings in Fig. 7.8a are for the implementation of Π1, which maps 0 7, 7 6, 64,
42, 2 0, 1 3, 3 1, 5 5.
Consider the routing of a message from input 001 to output 011. This involves the use of switches
A, B, and C. Since the most significant bit of the destination 011 is a "zero," switch A must be set
straight so that the input 001 is connected to the upper output (labeled 2).
The middle bit in 011 is a "one," thus input 4 to switch B is connected to the lower output with a
"crossover" connection.
The least significant bit in 011 is a "one," implying a flat connection in switch C.
Similarly, the switches A, E, and D are set for routing a message from input 101 to output 101.
There exists no conflict in all the switch settings needed to implement the permutation Π1 in Fig.
7.8a.
Now consider implementing the permutation Π2 in the 8-input Omega network (Fig. 7.8b0.
Conflicts in switch settings do exist in three switches identified as F, G, and H. The conflicts
occurring at F are caused by the desired routings 000 110 and 100 111.
Since both destination addresses have a leading bit 1, both inputs to switch F must be connected to
the lower output.
To resolve the conflicts, one request must be blocked.
Similarly we see conflicts at switch G between 011 000 and 111011, and at switch H between
101001 and 011 000. At switches I and J, broadcast is used from one input to two outputs,
which is allowed if the hardware is built to have four legitimate states as shown in fig. 2.24a.
The above example indicates the fact that not all permutations can be implemented in one pass
through the Omega network.
In total, sixteen 8x8 crossbar switches are used in Fig. 7.10a and 16 x 8+8 x 8 = 192 are used in
Fig. 7.10b. Larger Butterfly networks can be modularly constructed using more stages.
Note that no broadcast connections are allowed in a Butterfly network, making these networks a
restricted subclass of Omega networks.
Fectch&Add
This atomic memory operation is effective in implementing an N-way synchronization with a
complexity independent of N.
In a Fetch&Add(x, e) operation, i is an integer variable in shared memory and e is an integer
increment.
When a single processor executes this operation, the semantics is
Fetch&Add(x, e)
{ temp x;
x temp + e; (7.1)
return temp }
When N processes attempt to Fetch&Add(x, e) the same memory word simultaneously, the
memory is updated only once following a serialization principle.
The sum of the N increments, e1 + e2 + • • • + eN, is produced in any arbitrary serialization of the N
requests.
This sum is added to the memory word x, resulting in a new value x + e1 + e2 + • • • + eN
The values returned to the N requests are all unique, depending on the serialization order followed.
The net result is similar to a sequential execution of N Fetch&Adds but is performed in one
indivisible operation.
Two simultaneous requests are combined in a switch as illustrated in Fig. 7.11.
One of the following operations will be performed if processor P1 executes Ans1 Fetch&Add(x,e1)
and P2 executes Ans2 Fetch&Add(x,e2) simultaneously on the shared variable x.
If the request from P1 is executed ahead of that from P2, the following values are returned:
Ans1 x
Ans2 x+ e1 (7.2)
In a memory hierarchy for a multiprocessor system, data inconsistency may occur between
adjacent levels or within the same level.
For example, the cache and main memory may contain inconsistent copies of the same data object.
Multiple caches may possess different copies of the same memory block because multiple
processors operate asynchronously and independently.
Caches in a multiprocessing environment introduce the cache coherence problem. When multiple
processors maintain locally cached copies of a unique shared-memory location, any local
modification of the location can result in a globally inconsistent view of memory.
Cache coherence schemes prevent this problem by maintaining a uniform state for each cached
block of data.
Cache inconsistencies caused by data sharing, process migration or I/O are explained below.
• Consider a multiprocessor with two processors, each using a private cache and both sharing the
main memory.
• Let X be a shared data element which has been referenced by both processors. Before update, the
three copies of X are consistent.
• If processor P writes new data X’ into the cache, the same copy will be written immediately into
the shared memory under a write through policy.
• In this case. inconsistency occurs between the two copies (X and X') in the two caches.
• On the other hand, inconsistency may also occur when a write back policy is used, as shown on the
right.
• The main memory will be eventually updated when the modified data in the cache are replaced or
invalidated.
Notes by Shylaja B, Asst. Prof, Dept of CSE, DSATM, Bangalore 13
ACA (15CS72) Notes Module-4
In both cases, inconsistency appears between the two cache copies, labeled X and X’. Special
precautions must be exercised to avoid such inconsistencies. A coherence protocol must be established
before processes can safely rnigrate from one processor to another.
• On the other hand, scalable multiprocessor systems interconnect processors using short point-to-
point links in direct or multistage networks.
• Unlike the situation in buses, the bandwidth of these networks increases as more processors are
added to the system.
• However, such networks do not have a convenient snooping mechanism and do not provide an
efficient broadcast capability. In such systems, the cache coherence problem can be solved using
some variant of directory schemes.
Using a write-invalidate protocol, the processor P1 modifies (writes) its cache from X to X’, and all
other copies are invalidated via the bus (denoted I in Fig. 7.14b). Invalidated blocks are called
dirty, meaning they should not be used.
The write-update protocol (Fig. 7.14c) demands the new block content X’ be broadcast to all cache
copies via the bus.
The memory copy also updated if write through caches are used. In using write-back caches, the
memory copy is updated later at block replacement time.
• A remote processor is denoted j, where j # i. For each of the two cache states, six possible events
may take place.
• Note that all cache copies of the same block use the same transition graph in making state changes.
• In a valid state (Fig. 7.15a), all processors can read (R(i), R(j)) safely. Local processor i can also
write(W(i)) safely in a valid state. The invalid state corresponds to the case of the block either
being invalidated or being replaced (Z(i) or Z(j)).
2. Wormhole Routing
Packets are subdivided into smaller flits. Flit buffers are used in the hardware routers attached to
nodes.
The transmission from the source node to the destination node is done through a sequence of
routers.
All the flits in the same packet are transmitted in order as inseparable companions in a pipelined
fashion.
Only the header flit knows where the packet is going.
All the data flits must follow the header flit.
Flits from different packets cannot be mixed up. Otherwise they may be towed to the wrong
destination.
Asynchronous Pipelining
The pipelining of successive flits in a packet is done asynchronously using a handshaking protocol
as shown in Fig. 7.28. Along the path, a 1-bit ready/request (R/A) line is used between adjacent
routers.
When the receiving router (D) is ready (7.28a) to receive a flit (ie., a flit buffer is available), it pulls
the R/A line low. When the sending router (S) is ready (Fig. 2.8b), it raises the line high and
transmits flit I through the channel.
While the flit is being received by D (Fig. 7.28c), the R/A line is kept high. After flit I is removed
from D’s buffer (ie., transmitted to the next node) (Fig. 7.28d), the cycle repeats itself for the
transmission of the next flit i+1 until the entire packet is transmitted.
Advantages:
Very efficient
Faster clock
Latency Analysis:
Wormhole Routing has a latency almost independent of the distance between the source and the
destination
TWH = L / W + F D / W
Virtual channels
A virtual channel is logical link between two nodes. It is formed by a flit buffer in the source node,
a physical channel between them and a flit buffer in the receiver node.
Four flit buffers are used at the source node and receiver node respectively. One source buffer is
paired with one receiver buffer to form a virtual channel when the physical channel is allocated for
the pair.
Thus the physical channel is time shared by all the virtual channels. By adding the virtual channel
the channel dependence graph can be modified and one can break the deadlock cycle.
Here the cycle can be converted to spiral thus avoiding a deadlock. Virtual channel can be
implemented with either unidirectional channel or bidirectional channels.
However a special arbitration line is needed between adjacent nodes interconnected by
bidirectional channel. This line determines the direction of information flow.
The virtual channel may reduce the effective channel bandwidth available to each request.
There exists a tradeoff between network throughput and communication latency in determining the
degree of using virtual channels.
Deadlock Avoidance
By adding two virtual channels, V3 and V4 in Fig. 7.32c, one can break the deadlock cycle. A modified
channel-dependence graph is obtained by using the virtual channels V3 and V4, after the use of channel
C2, instead of reusing C3 and C4.
The cycle in Fig. 7.32b is being converted to a spiral, thus avoiding a deadlock. Channel multiplexing
can be done at the flit level or at the packet level if the packet length is sufficiently short.
Virtual channels can be implemented with either unidirectional channels or bidirectional channels.
Vector: A vector is a set of scalar data items, all of the same type, stored in memory. Usually, the
vector elements are ordered to have a fixed addressing increment between successive elements called
the stride.
Vector Processing: Vector processing occurs when arithmetic or logical operations are applied to
vectors. It is distinguished from scalar processing which operates on one or one pair of data.
Vector processing is faster and more efficient than scalar processing.
Vectorization: The conversion from scalar code to vector code is called vectorization.
1. Vector - Vector instructions: One or two vector operands are fetched form the respective
vector registers, enter through a functional pipeline unit, and produce result in another vector
register.
F1: Vi Vj
F2: Vi x Vj Vk
F3: s x Vi Vj
Examples: V2 = 6 + V1
3. Vector - Memory instructions: This corresponds to Store-load of vector registers (V) and
the Memory (M).
Examples: X = V1 V2 = Y
4. Vector reduction instructions: include maximum, minimum, sum, mean value.
F6: Vi s
F7: Vi x Vj s
5. Gather and scatter instructions Two instruction registers are used to gather or scatter
vector elements randomly throughout the memory corresponding to the following mappings
F8: M Vi x Vj (Gather)
F9: Vi x Vj M (Scatter)
Gather is an operation that fetches from memory the nonzero elements of a sparse
vector using indices.
Scatter does the opposite, storing into memory a vector in a sparse vector whose
nonzero entries are indexed.
6. Masking instructions The Mask vector is used to compress or to expand a vector to a shorter
or longer index vector (bit per index correspondence).
F10: Vi x Vm Vj (Vm is a binary vector)
The gather, scatter, and masking instructions are very useful in handling sparse vectors or sparse
matrices often encountered in practical vector processing applications.
Sparse matrices are those in which most of the entries arc zeros.
Advanced vector processors implement these instructions directly in hardware.
To access a vector with a stride of 1, successive addresses are latched in the address buffer at the
rate of one per cycle.
Effectively it takes m minor cycles to fetch m words, which equals one (major) memory cycle as
stated in Fig. 5.16b.
If the stride is 2, the successive accesses must be separated by two minor cycles in order to avoid
access conflicts. This reduces the memory throughput by one-half.
If the stride is 3, there is no module conflict and the maximum throughput (m words) results.
In general, C-access will yield the maximum throughput of m words per memory cycle if the stride
is relatively prime to m, the number of interleaved memory modules.
At the end of each memory cycle (Fig. 8.3b), m = 2a consecutive words are latched. If the stride is
greater than 1, then the throughput decreases, roughly proportionally to the stride.
Do 10 I=1,N
Load R1, X(I)
Load R2, Y(I)
Multiply R1, S
Add R2, R1
Store Y(I), R2
10 Continue
where X(I) and Y(I), I=1, 2,…. N, are two source vectors originally residing in the memory. After the
computation, the resulting vector is stored back to the memory. S is an immediate constant supplied to
the multiply instruction.
After vectorization, the above scalar SAXPY code is converted to a sequence of five vector
instructions:
M( x : x + N-1) V1 Vector Load
M( y : y + N-1) V2 Vector Load
S X V1 V1 Vector Multiply
V2 X V1 V2 Vector Add
V2 M( y : y + N-1) Vector Store
X and y are starting memory addresses of the X and Y vectors, respectively; V1 and V2 are two
N-element vector registers in the vector processor.
Typical CVF for one-dimensional arrays are load, store, multiply, divide, logical and
shifting operations.
The number of available vector registers and functional pipelines impose some restrictions on how
many CVFs can be executed simultaneously.
Chaining:
Chaining is an extension of technique of internal data forwarding practiced in scalar processors.
Chaining is limited by the small number of functional pipelines available in a vector processor.
Strip-mining:
When a vector has a length greater than that of the vector registers, segmentation of the long vector
into fixed-length segments is necessary. One vector segment is processed at a time (in Cray
computers segment is 64 elements).
Recurrence:
The special case of vector loops in which the output of a functional pipeline may feed back into
one of its own source vector registers
SIMD Implementation Models OR (Two models for constructing SIMD Super Computers)
SIMD models differentiates on base of memory distribution and addressing scheme used.
Most SIMD computers use a single control unit and distributed memories, except for a few that use
associative memories.
1. Distributed memory model
PEs are interconnected by a data routing network which performs inter-PE data communications
such as shifting, permutation and other routing operations.
Data sets and results could be exchanged between the front-end and the processing array in one of
three ways as shown in the figure:
Broadcasting: Broadcasting was carried out through the broadcast bus to all data
processors at once.
Global combining: Global combining allowed the front-end to obtain the sum, largest
value, logical OR etc, of values one from each processor.
Scalar memory bus: Scalar bus allowed the front-end to read or to write one 32-bit value
at a time from or to the memories attached to the data processors.
Processing Array
The processing array contained from 4K to 64K bit-slice data processors(PEs), all of which were
controlled by a sequencer.
Processing Nodes
Each data processing node contained 32 bit-slice data processors, an optional floating point accelerator
and interfaces for inter processor communication.
Hypercube Routers
The router nodes on all processor chips were wired together to frm a Boolean n-cube. A full
configuration of CM-2 had 4096 router nodes on processor chips interconnected as a 12-dimensional
hypercube.
Physical memory was distributed among the processing nodes in various clusters. The distributed
memory formed a global address space.
Cache coherence was maintained using an invalidating, distributed directory-based protocol. For
each memory block, the directory kept track of remote nodes caching it.
When a write occurred, point-to-point messages were sent to invalidate remote copies of the block.
Acknowledgement messages were used to inform the originating node when an invalidation was
completed.
Two levels of local cache were used per processing node. Loads and writes were separated with the
use of write buffers for implementing weaker memory consistency models.
The main memory was shared by all processing nodes in the same cluster. To facilitate prefetching
and the directory-based coherence protocol, directory memory and remote-access caches were used
for each cluster.
The remote-access cache was shared by all processors in the same cluster.
The system uses virtual addresses instead of physical addresses for memory references.
Each virtual address space can be as large as a single node can provide and is shared by all nodes in
the system.
The SVM address space is organized in pages which can be accessed by any node in the system. A
memory-mapping manager on each node views its local memory as a large cache of pages for its
associated processor.
Page Swapping
A memory reference causes a page fault when the page containing the memory location is not in a
processor’s local memory.
Notes by Shylaja B, Asst. Prof, Dept of CSE, DSATM, Bangalore 44
ACA (15CS72) Notes Module-4
When a page fault occurs, the memory manager retrieves the missing page from the memory of
another processor.
If there is a page frame available on the receiving node, the page is moved in.
Otherwise, the SVM system uses page replacement policies to find an available page frame,
swapping its contents to the sending node.
A hardware MMU can set the access rights (nil, read-only} writable) so that a memory access
violating memory coherence will cause a page fault.
The memory coherence problem is solved in IVY through distributed fault handlers and their
servers. To client programs, this mechanism is completely transparent.
The large virtual address space allows programs to be larger in code and data space than the
physical memory on a single node.
This SVM approach offers the ease of shared-variable programming in a message-passing
environment.
In addition, it improves software portability and enhances system scalability through modular
memory growth.
1. Prefetching Techniques
Prefetching uses knowlwdge about the expected misses in a program to move the
corresponding data close to the processor before it is actually needed.
Prefetching can be classified based on whether it is
Binding
Non binding
or whether it is controlled by
hardware
software
Binding prefetching : the value of a later reference (eg, a register load) is bound at the time when
the prefetch completes.
Non binding prefetching : brings data close to the processor, but the data remains visible to the
cache coherence protocol and is thus kept consistent until the processor actually reads the value.
Hardware Controlled Prefetching: includes schemes such as long cache lines and instruction
lookahead.
Software Controlled Prefetching: explicit prefetch instructions are issued. Allows the prefetching
to be done selectively and extends the possible interval between prefetch issue and actual reference.
2. Coherent Caches
While the cache coherence problem is easily solved for small bus-based multiprocessors through
the use of snoopy cache coherence protocols, the problem is much more complicated for large scale
multiprocessors that use general interconnection networks.
As a result, some large scale multiprocessors did not provide caches, others provided caches that
must be kept coherent by software, and still others provided full hardware support for coherent
caches.
Caching of shared read-write data provided substantial gains in performance. The largest benefit
came from a reduction of cycles wasted due to read misses. The cycles wasted due to write misses
were also reduced.
Hardware cache coherence is an effective technique for substantially increasing the performance
with no assistance from the compiler or programmer.
Architecture Environment
Multithreading MPP system is modeled by a network of Processor (P) and memory (M) nodes as
shown in Fig. 9.11a. The distributed memories form a global address space.
Four machine parameters are defined below to analyze the performance of this network:
1. The Latency (L): This is the communication latency on a remote memory access. The value of
L includes the network delays, cache-miss penalty and delays caused by contentions in split
transactions.
2. The number of Threads (N): This is the number of threads that can be interleaved in each
processor. A thread is represented by a context consisting of a program counter, a register set
and the required context status words.
3. The context-switching overhead (C): This refers to the cycles lost in performing context
switching in a processor. This time depends on the switch mechanism and the amount of
processor states devoted to maintaining active threads.
4. The interval between switches (R): This refers to the cycles between switches triggered by
remote reference. The inverse p=1/R is called the rate of requests for remote accesses. This
reflects a combination of program behavior and memory system design.
In order to increase efficiency, one approach is to reduce the rate of requests by using distributed
coherent caches. Another is to eliminate processor waiting through multithreading.
Multithreaded Computations
Fig 9.11b shows the structure of the multithreaded parallel computations model.
The computation starts with a sequential thread (1), followed by supervisory scheduling (2), where
the processors begin threads of computation (3), by intercomputer messages that update variables
among the nodes when the computer has distributed memory (4), and finally by synchronization
prior to beginning the next unit of parallel work (5).
The communication overhead period (4) inherent in distributed memory structures is usually
distributed throughout the computation and is possibly completely overlapped.
Message passing overhead in multicomputers can be reduced by specialized hardware operating in
parallel with computation.
Communication bandwidth limits granularity, since a certain amount of data has to be transferred
with other nodes in order to complete a computational grain. Message passing calls (4) and
synchronization (5) are nonproductive.
Fast mechanisms to reduce or to hide these delays are therefore needed. Multithreading is not
capable of speedup in the execution of single threads, while weak ordering or relaxed consistency
models are capable of doing this.
Problems of Asynchrony
Massively parallel processors operate asynchronously in a network environment. The asynchrony
triggers two fundamental latency problems:
1. Remote loads
2. Synchronizing loads
Clearly the cost of thread switching should be much smaller than that of the latency of the remote
load, or else the processor might as well wait for the remote load’s response.
As the internode latency increases, more threads are needed to hide it effectively. Another concern
is to make sure that messages carry continuations. Suppose, after issuing a remote load from thread
T1 (Fig 9.13a), we switch to thread T2, which also issues a remote load.
The responses may not return in the same order. This may be caused by requests traveling different
distances, through varying degrees of congestion, to destination nodes whose loads differ greatly,
etc.
One way to cope with the problem is to associate each remote load and response with an identifier
for the appropriate thread, so that it can be reenabled on the arrival of a response.
2. Distributed Caching
The concept of Distributed Caching is shown in Fig. 9.13b. every memory location has an owner
node. For example, N1 owns B and N2 owns A.
The directories are used to contain import-export lists and state whether the data is shared (for
reads, many caches may hold copies) or exclusive (for writes, one cache holds the current value).
The directories multiplex among a small number of contexts to cover the cache loading effects.
The Distributed Caching offers a solution for the remote-loads problem, but not for the
synchronizing-loads problem.
Multithreading offers a solution for remote loads and possibly for synchronizing loads.
The two approaches can be combined to solve both types of remote access problems.
The objective is to maximize the fraction of time that the processor is busy, we will use the efficiency
of the processor as our performance index, given by:
where busy, switching and idle represent the amount of time, measured over some large interval, that
the processor is in the corresponding state.
The basic idea behind a multithreaded machine is to interleave the execution of several contexts on
order to dramatically reduce the value of idle, but without overly increasing the magnitude of
switching.
Context-Switching Policies
Different multithreaded architectures are distinguished by the context-switching policies adopted.
Four switching policies are:
1. Switch on Cache miss – This policy corresponds to the case where a context is preempted
when it causes a cache miss.
In this case, R is taken to be the average interval between misses (in Cycles) and L the time
required to satisfy the miss.
Here, the processor switches contexts only when it is certain that the current one will be
delayed for a significant number of cycles.
2. Switch on every load - This policy allows switching on every load, independent of whether it
will cause a miss or not.
In this case, R represents the average interval between loads. A general multithreading model
assumes that a context is blocked for L cycles after every switch; but in the case of a switch-on-
load processor, this happens only if the load causes a cache miss.
3. Switch on every instruction – This policy allows switching on every instruction, independent
of whether it is a load or not. Successive instructions become independent , which will benefit
pipelined execution.
4. Switch on block of instruction – Blocks of instructions from different threads are interleaved.
This will improve the cache-hit ratio due to locality. It will also benefit single-context
performance.
MODULE-5
1. Shared-Variable Model
• In all programming system, processors are active resources and memory & IO devices are
passive resources. Program is a collection of processes. Parallelism depends on how
IPC(Interprocess Communication) is implemented. Process address space is shared.
• To ensure orderly IPC, a mutual exclusion property requires that shared object must be shared
by only 1 process at a time.
Critical Section
• Critical Section(CS) is a code segment accessing shared variable, which must be executed by
only one process at a time and which once started must be completed without interruption.
• It should satisfy following requirements-:
Mutual Exclusion: At most one process executing CS at a time.
No deadlock in waiting: No circular wait by 2 or more process.
No preemption: No interrupt until completion.
Eventual Entry: Once entered CS,must be out after completion.
Protected Access
• Granularity of CS affects the performance.
• If CS is too large,it may limit parallism due to excessive waiting by process.
• When CS is too small,it may add unnecessary code complexity/Software overhead.
4 operational Modes
• Multiprogramming
• Multiprocessing
• Multitasking
Multithreading
2. Message Passing Model
Two processes D and E residing at different processor nodes may communicate wit each other by
passing messages through a direct network. The messages may be instructions,
• Objects are program entities which encapsulate data and operations in single unit.
• Concurrent manipulation of objects in OOP.
Actor Model
• This is a framework for Concurrent OOP.
• Actors -> independent component
• Communicate via asynchronous message passing.
• 3 primitives -> create, send-to and become.
Parallelism in COOP
3 common patterns for parallelism-:
1) Pipeline concurrency
2) Divide and conquer
3) Cooperative Problem Solving
5. Functional and logic Model
• Functional Programming Language-> Lisp,Sisal and Strand 88.
Logic Programming Language-> Concurrent Prolog and Parlog
Functional Programming Model
• Should not produce any side effects.
• No concept of storage,assignment and branching.
• Single assignment and data flow language functional in nature.
Logic Programming Models
• Used for knowledge processing from large database.
• Supports implicitly search strategy.
• And parallel execution and Or Parallel Reduction technique used.
• Used in artificial intelligence
1. Optimization Features
• Conversion of sequential Program to Parallel Program.
• The purpose is to match s/w parallelism with hardware parallelism.
• Software in Practice-:
1) Automated Parallelizer
Express C automated parallelizer and Allaint FX Fortran compiler.
2) Semiautomated Parallizer
Needs compiler directives or programmers interaction.
2. Availability Features
Enhance user friendliness, make language portable for large no of parallel computers and
expand the applicability of software libraries.
1) Scalability
Language should be scalable to number of processors and independent of hardware
topology.
2) Compatibility
Compatible with sequential language.
3) Portability
Language should be portable to shared memory multiprocessor, message passing or both.
3. Synchronization/Communication Features
• Shared Variable (locks) for IPC
• Remote Procedure Calls
• Data Flow languages
• Mailbox,Semaphores,Monitors
4. Control of Parallelism
• Coarse,Medium and fine grain
• Explicit vs implicit parallelism
• Global Parallelism
• Loop Parallelism
• Task Parallelism
• Divide and Conquer Parallelism
• The (ForAll, EndAll) and (ParDo, ParEnd) commands can be interpreted either as a Doall loop
or as a Doacross loop.
Doacross I=2, N
Do J=2, N
S1: A9I, J) = (A(I, J-1)) + A(I, J+1)) / 2
Enddo
Endacross
Another program construct is (Cobegin, Coend) pair. All computations specified within the block
could be executed in parallel.
Cobegin
P1
P2
…..
Pn
Coend
Causes processes P1, P2,… Pn to start simultaneously and to proceed concurrently until they have all
ended. The command (Parbegin, Parend) has equivalent meaning.
During the execution of a process, we can use a Fork Q command to spawn a new process Q:
Process P Process Q
….. ………
Fork Q ………
…… End
Join Q
The Join Q command recombines the two processes into one process.
1) Flow analysis
• Reveals design flow patters to determine data and control dependencies.
• Flow analysis carried at various execution levels.
1)Instruction level->VLSI or superscaler processors.
2)Loop level->Simd and systolic computer
3)Task level->Multiprocessor/Multicomputer
2) Optimization
• Transformation of user program to explore hardware capability.
• Explores better performance.
• Goal to maximise speed of code execution.
• To minimize code length.
• Local and global optimizations.
Machine dependent Transformation
(a) Computer Architecture is defined as the arrangement by which the various system building
blocks—processors, functional units, main memory, cache, data paths, and so on—are interconnected
and inter-operated to achieve desired system performance.
(b) Processors make up the most important part of a computer system. Therefore, in addition to (a),
processor design also constitutes a central and very important element of computer architecture.
Various functional elements of a processor must be designed, interconnected and inter-operated to
achieve desired processor performance.
System performance is the key benchmark in the study of computer architecture. A computer
system must solve the real world problem, or support the real world application, for which the user
is installing it.
A basic rule of system design is that there should be no performance bottlenecks in the system.
Typically, a performance bottleneck arises when one part of the system.
In a computer system, the key subsystems are processors, memories, I/O interfaces, and the data
paths connecting them. Within the processors, we have subsystems such as functional units,
registers, cache memories, and internal data buses.
Within the computer system as a whole—or within a single processor—designers do not wish to
create bottlenecks to system performance.
Let us assume that our superscalar processor is designed for k instruction issues in every processor
clock cycle.
Clearly then the fetch, decode and issue pipeline stages, as well as the other elements of the
processor, must all be designed to process k instructions in every clock cycle.
On multiple issue pipelines, issue stage is usually separated from decode stage. One reason for thus
increasing a pipeline stage is that it allows the processor to be driven by a faster clock.
Decode stage must be seen as preparation for instruction issue which—by definition—can occur
only if the relevant functional unit in the processor is in a state in which it can accept one more
operation for execution.
As a result of the issue, the operation is handed over to the functional unit for execution.
The process of issuing instructions to functional units also involves instruction scheduling. For
example, if instruction Ij cannot be issued because the required functional unit is not free, then it
may still be possible to issue the next instruction Ij+1—provided that no dependence between the
two prohibits issuing instruction Ij+1.
When instruction scheduling is specified by the compiler in the machine code it generates, we refer
to it as static scheduling.
In theory, static scheduling should free up the processor hardware from the complexities of
instruction scheduling; in practice, though, things do not quite turn out that way.
If the processor control logic schedules instruction on the fly—taking into account inter-instruction
dependences as well as the state of the functional units—we refer to it as dynamic scheduling.
Much of the rest of this chapter is devoted to various aspects and techniques of dynamic
scheduling.
The basic aim in both types of scheduling—static as well as dynamic—is to maximize the
instruction level parallelism which is exploited in the executing sequence of instructions.
The process of issuing instructions to functional units also involves instruction scheduling.
A branch prediction unit has also been shown in Fig. 12.4 and Fig. 12.5 to implement some form of
a branch prediction algorithm
Figure 12.5 shows a processor design in which functional units are provided with reservation stations.
Such designs usually also make use of operand forwarding over a common data bus (CDB), with tags
to identify the source of data on the bus. Such a design also implies register renaming, which resolves
RAW and WAW dependences.
One relatively simple technique which the compiler can employ is known as loop unrolling, by which
independent instructions from multiple successive iterations of a loop can be made to execute in
parallel. Unrolling means that the body of the loop is repeated n times for n successive values of the
control variable—so that one iteration of the transformed loop performs the work of n iterations of the
original loop.
Loop Unrolling
Independent instructions from multiple successive iterations of a loop can be made to execute in
parallel. Unrolling means that the body of the loop is repeated n times for n successive values of the
control variable—so that one iteration of the transformed loop performs the work of n iterations of the
original loop.
Consider the following body of a loop in a user program, where all the variables except the loop
control variable i are assumed to be floating point:
for i = 0 to 58 do
c[i] = a[i]*b[i] – p*d[i];
Now suppose that machine code is generated by the compiler as though the original program had
been written as:
for j = 0 to 52 step 4 do
{
c[j] = a[j]*b[j] – p*d[j];
c[j + 1] = a[j + 1]*b[j + 1] – p*d[j + 1];
c[j + 2] = a[j + 2]*b[j + 2] – p*d[j + 2];
c[j + 3] = a[j + 3]*b[j + 3] – p*d[j + 3];
}
c[56] = a[56]*b[56] – p*d[56];
c[57] = a[57]*b[57] – p*d[57];
Notes by Shylaja B, Asst. Prof, Dept of CSE, DSATM, Bangalore 18
ACA (15CS72) Notes Module-5
It helps in reducing the impact of true data dependences in the instruction stream. Consider the
following simple sequence of two instructions in a running program:
ADD R1, R2, R3
SHIFTR #4, R3, R4
The result of the ADD instruction is stored in destination register R3, and then shifted right by four bits
in the second instruction, with the shifted value being placed in R4.
Thus, there is a simple RAW dependence between the two instructions—the output of the first is
required as input operand of the second represented in the form of graph as below:
In a pipelined processor, ideally the second instruction should be executed one stage—and
therefore one clock cycle—behind the first.
However, the difficulty here is that it takes one clock cycle to transfer ALU output to destination
register R3, and then another clock cycle to transfer the contents of register R3 to ALU input for
the right shift. Thus a total of two clock cycles are needed to bring the result of the first instruction
where it is needed for the second instruction.
Therefore, as things stand, the second instruction above cannot be executed just one clock cycle
behind the first.
This sequence of data transfers has been illustrated in Fig. 12.7 (a). In clock cycle Tk, ALU output
is transferred to R3 over an internal data path. In the next clock cycle Tk + 1, the content of R3 is
transferred to ALU input for the right shift.
When carried out in this order, clearly the two data transfer operations take two clock cycles.
But note that the required two transfers of data can be achieved in only one clock cycle if ALU
output is sent to both R3 and ALU input in the same clock cycle—as illustrated in Fig. 12.7 (b).
In general, if X is to be copied to Y, and in the next clock cycle Y is to be copied to Z, then we can
just as well copy X to both Y and Z in one clock cycle.
Since instructions execute in parallel on multiple functional units, the reorder buffer serves the
function of bringing completed instructions back into an order which is consistent with program
order.
Note that instructions may complete in an order which is not related to program order, but must be
committed in program order.
At any time, program state and processor state are defined in terms of instructions which have
been committed—i.e. their results are reflected in appropriate registers and/or memory locations.
Entries in the reorder buffer are completed instructions, which are queued in program order.
However, since instructions do not necessarily complete in program order, we also need a flag with
each reorder buffer entry to indicate whether the instruction in that position has completed.
Figure 12.9 shows a reorder buffer of size eight. Four fields are shown with each entry in the reorder
buffer—instruction identifier, value computed, program-specified destination of the value computed,
and a flag indicating whether the instruction has completed (i.e. the computed value is available).
We now take a brief look at how the use of reorder buffer addresses the various types of dependences
in the program.
1. Data Dependences: A RAW dependence—i.e. true data dependence—will hold up the execution of
the dependent instruction if the result value required as its input operand is not available. As suggested
above, operand forwarding can be added to this scheme to speed up the supply of the needed input
operand as soon as its value has been computed.
WAR and WAW dependences—i.e. anti-dependence and output dependence, respectively—
also hold up the execution of the dependent instruction and create a possible pipeline stall. We
shall see below that the technique of register renaming is needed to avoid the adverse impact of
these two types of dependences.
2. Control Dependences:
Suppose the instruction(s) in the reorder buffer belong to a branch in the program which should not
have been taken—i.e. there has been a mis-predicted branch. Clearly then the reorder buffer should be
flushed along with other elements of the pipeline.
Therefore the performance impact of control dependences in the running program is determined by the
accuracy of branch prediction technique employed.
The reorder buffer plays no direct role in the handling of control dependences.
3. Resource Dependences: If an instruction needs a functional unit to execute, but the unit is not
free, then the instruction must wait for the unit to become free—clearly no technique in the world can
change that.
In such cases, the processor designer can aim to achieve at least this: if a subsequent instruction needs
to use another functional unit which is free, then the subsequent instruction can be executed out of
order.
Traditional compilers and assembly language programmers work with a fairly small number of
programmable registers.
Therefore the only way to make a larger number of registers available to instructions under
execution within the processor is to make the additional registers invisible to machine language
instructions.
Instructions under execution would use these additional registers, even if instructions making up
the machine language program stored in memory cannot refer to them.
For example, let us say that the instruction:
FADD R1, R2, R5
is followed by the instruction:
FSUB R3, R4, R5
Both these instructions are writing to register R5, creating thereby a WAW dependence—i.e. output
dependence—on register R5.
Let FSUB write its output value to a register other than R5, and let us call that other register X.
Then the instructions which use the value generated by FSUB will refer to X, while the instructions
which use the value generated by FADD will continue to refer to R5.
Now, since FADD and FSUB are writing to two different registers, the output dependence or
WAW between them has been removed.
When FSUB commits, then the value in R5 should be updated by the value in X—i.e. the value
computed by FSUB.
Then the physical register X, which is not a program visible register, can be freed up for use in
another such situation.
Note that here we have mapped—or renamed—R5 to X, for the purpose of storing the result of
FSUB, and thereby removed the WAW dependence from the instruction stream. A pipeline stall
will now not be created due to the WAW dependence.
When the needed operand value or values are available in a reservation station, the functional unit
can initiate the required operation in the next clock cycle.
At the time of instruction issue, the reservation station is filled out with the operation code (op).
If an operand value is available, for example in a programmable register, it is transferred to the
corresponding source operand field in the reservation station.
However, if the operand value is not available at the time of issue, the corresponding source tag (t1
and/or t2) is copied into the reservation station.
The source tag identifies the source of the required operand. As soon as the required operand value
is available at its source—which would be typically the output of a functional unit—the data value
is forwarded over the common data bus, along with the source tag.
This value is copied into all the reservation station operand slots which have the matching tag.
Thus operand forwarding is achieved here with the use of tags. All the destinations which require a
data value receive it in the same clock cycle over the common data bus, by matching their stored
operand tags with the source tag sent out over the bus.
Assume that instruction I1 is to write its result into R4, and that two subsequent instructions I2 and
I3 are to read—i.e. make use of—that result value.
Thus instructions I2 and I3 are truly data dependent (RAW dependent) on instruction I1. See Fig.
12.13.
Assume that the value in R4 is not available when I2 and I3 are issued; the reason could be, for
example, that one of the operands needed for I1 is itself not available.
Thus we assume that I1 has not even started executing when I2 and I3 are issued.
When I2 and I3 are issued, they are parked in the reservation stations of the appropriate functional
units.
Since the required result value from I1 is not available, these reservation station entries of I2 and I3
get source tag corresponding to the output of I1—i.e. output of the functional unit which is
performing the operation of I1.
When the result of I1 becomes available at its functional unit, it is sent over the common data bus
along with the tag value of its source—i.e. output of functional unit.
At this point, programmable register R4 as well as the reservation stations assigned to I2 and I3
have the matching source tag—since they are waiting for the same result value, which is being
computed by I1.
When the tag sent over the common data bus matches the tag in any destination, the data value on
the bus is copied from the bus into the destination.
The copy occurs at the same time into all the destinations which require that data value. Thus R4 as
well as the two reservation stations holding I2 and I3 receive the required data value, which has
been computed by I1, at the same time over the common data bus.
Thus, through the use of source tags and the common data bus, in one clock cycle, three destination
registers receive the value produced by I1—programmable register R4, and the operand registers in
the reservation stations assigned to I2 and I3.
Let us assume that, at this point, the second operands of I2 and I3 are already available within their
corresponding reservation stations.
Then the operations corresponding to I2 and I3 can begin in parallel as soon as the result of I1
becomes available—since we have assumed here that I2 and I3 execute on two separate functional
units.
A basic branch prediction technique uses a so-called two-bit predictor. A two-bit counter is
maintained for every conditional branch instruction in the program. The two-bit counter has four
possible states; these four states and the possible transitions between these states are shown in Fig.
12.15.
When the counter state is 0 or 1, the respective branch is predicted as taken; when the counter state
is 2 or 3, the branch is predicted as not taken.
When the conditional branch instruction is executed and the actual branch outcome is known, the
state of the respective two-bit counter is changed as shown in the figure using solid and broken line
arrows.
When two successive predictions come out wrong, the prediction is changed from branch taken to
branch not taken, and vice versa.
In Fig. 12.14, state transitions made on mis-predictions are shown using broken line arrows, while
solid line arrows show state transitions made on predictions which come out right.
This scheme uses a two-bit counter for every conditional branch, and there are many conditional
branches in the program.
Overall, therefore, this branch prediction logic needs a few kilobytes or more of fast memory.
One possible organization for this branch prediction memory is in the form of an array which is
indexed by low order bits of the instruction address.
If twelve low order bits are used to define the array index, for example, then the number of entries
in the array is 4096.
To be effective, branch prediction should be carried out as early as possible in the instruction
pipeline.
As soon as a conditional branch instruction is decoded, branch prediction logic should predict
whether the branch is taken.
Accordingly, the next instruction address should be taken either as the branch target address (i.e.
branch is taken), or the sequentially next address in the program (i.e. branch is not taken).
One way to reduce the burden of dependences is to combine—with hardware support within the
processor—instructions from multiple independent threads of execution.
Such hardware support for multi-threading would provide the processor with a pool of instructions,
in various stages of execution, which have a relatively smaller number of dependences amongst
them, since the threads are independent of one another.
Let us consider once again the processor with instruction pipeline of depth eight, and with targeted
superscalar performance of four instructions completed in every clock cycle.
Now suppose that these instructions come from four independent threads of execution. Then, on
average, the number of instructions in the processor at any one time from one thread would be 4 X
8/4 = 8.
With the threads being independent of one another, there is a smaller total number of data
dependences amongst the instructions in the processor.
Further, with control dependences also being separated into four threads, less aggressive branch
prediction is needed.
Another major benefit of such hardware-supported multi-threading is that pipeline stalls are very
effectively utilized.
If one thread runs into a pipeline stall—for access to main memory, say—then another thread
makes use of the corresponding processor clock cycles, which would otherwise be wasted.
Thus hardware support for multi-threading becomes an important latency hiding technique.
To provide support for multi-threading, the processor must be designed to switch between
threads—either on the occurrence of a pipeline stall, or in a round robin manner.
As in the case of the operating system switching between running processes, in this case the
hardware context of a thread within the processor must be preserved.
Depending on the specific strategy adopted for switching between threads, hardware support for
multi-threading may be classified as one of the following: