Cs8083 MCP Unit I Notes
Cs8083 MCP Unit I Notes
UNIT I
Single core to Multi-core architectures – SIMD and MIMD systems –
Interconnection networks - Symmetric and Distributed Shared Memory
Architectures – Cache coherence - Performance Issues – Parallel program
design
1.1 INTRODUCTION
The Processor is the main component of a computer system. It is a logical circuitry
that processes instructions. It is also called CPU. It is the brain of the computer system.
Processor is mainly responsible to do all the computational calculations, logical decision
making and to control different activities of the system.
The main work of the processor is to execute low level instructions loaded into the
memory. The processor can be manufactured using different technologies:
▪ Single core processor
▪ Multicore processor
The processor can be divided into three types – multiprocessors, multithreaded
processors and multicore processors.
Let us consider two arrays x and y, each with n elements, and we want to add the
elements of y to the elements of x:
for (i = 0; i < n; i++)
x[i] += y[i];
Assume that the SIMD systems contains ‘n’ ALUs: The elements of x[i] and y[i] into the
ith ALU, have the ith ALU and it will add y[i] with x[i], and store the result in x[i].
Therefore all the ALUs are busy until the completion of the computation.
Let us assume another situation where the SIMD system contains ‘m’ number of ALUs
and m<n: Now the addition operation can be performed in blocks of ‘m’ elements at a
time. In this case, while operating last block of elements, some of the ALUs will be in
idle state.
Examples: Parallel Super computers, Vector processors, Graphical Processing Units
etc.
Vector processors
By the late 1990s the widely used SIMD systems were Vector Processors. Vector
Processors are used to operate over arrays or vectors of data, while conventional CPUs
operate over individual data elements or scalars.
Characteristics:
Vector registers- These are registers capable of storing a vector of operands and
operating simultaneously on their contents. The vector length is fixed by the system, and
can range from 4 to 128 64-bit elements.
Vector instructions- These are instructions that operate on vectors rather than scalars. If
the vector length is vector length, these instructions have the great virtue that a simple
loop such as
for (i = 0; i < n; i++)
x[i] += y[i];
requires only a single load, add, and store for each block of vector length elements, while
a conventional system requires a load, add, and store for each element.
Interleaved memory-The memory system consists of multiple “banks” of memory,
which can be accessed more or less independently. After accessing one bank, there will
be a delay before it can be reaccessed, but a different bank can be accessed much sooner.
So if the elements of a vector are distributed across multiple banks, there can be little to
no delay in loading/storing successive elements.
Strided memory - In strided memory access, the program accesses elements of a vector
located at fixed intervals. For example, accessing the first element, the fifth element, the
ninth element, and so on, would be strided access with a stride of four.
Hardware Scatter/gather – Scatter is writing and gather is reading elements of a
vector located at irregular intervals—for example, accessing the first element, the second
element, the fourth element, the eighth element, and so on. Typical vector systems
provide special hardware to accelerate strided access and scatter/gather.
Graphics processing units
Real-time graphics application programming interfaces, or APIs, use points, lines, and
triangles to internally represent the surface of an object. They use a graphics processing
pipeline to convert the internal representation into an array of pixels that can be sent to a
computer screen. Several of the stages of this pipeline are programmable. The behavior
of the programmable stages is specified by functions called shader functions. The shader
functions are typically quite short—often just a few lines of C code.
Interconnection network connects all the processors directly to main memory. Hence
the time to access all the memory locations will be the same for all the cores. (fig
1.4)
In NUMA, the interconnection network can directly connect each processor to
different blocks of main memory. Here the time to access the memory location to
which a core is directly connected is less when compared with the access time of
other blocks. (fig 1.5)
Good flexibility since multiple devices can be connected to a bus with little additional
cost
Disadvantages:
If the number of devices connected to the bus increases, the contention for use of the bus
also increases since communication wires are shared. This in turn decreases the expected
performance of the bus.
If there are more number of processors connected to a bus, then the processors would
frequently wait for access to main memory
2. Crossbars
If there are more number of systems connected, buses are replaced by switched
interconnects. To control the routing of data among the connected devices, switches are
used. The most commonly used one is crossbar switch.
Here the communication links are bidirectional. The circles represent the switches. The
configuration of the switch is shown in the figure
When we use crossbar switches, there will be a conflict between two cores, if they
attempt to accept the same memory modules simultaneously. Fig below shows the
simultaneous memory access using crossbar switches.
• P1 writes to M4
• P2 reads from M3
• P3 reads from M1
• P4 writes to M2
Advantages:
Much faster than buses
Performance is good for simultaneous accessing
Disadvantages:
Cost of the switches and links are high
1. Direct interconnect
In a direct interconnect each switch is directly connected to a processor memory pair,
and the switches are connected to each other.
The ideal direct interconnect is a fully connected network in which each switch is
directly connected to every other switch. (fig 1.8)
Ring
A ring is used to carry out multiple simultaneous communications. If there are ‘n’
processors, the number of links is 2n.
Fig 1.9 Typical structure of Ring and two dimensional toroidal mesh
Advantages:
Simple in nature
Less expensive
Disadvantages:
Some processors should be idle until other processors complete their tasks
Toroidal mesh
In toroial mesh, the complex switches are mainly used. If there are ‘n’ processors, the
number of links are 3n.
Advantage:
Good connectivity as compared with ring
Disadvantage:
Complex in nature
More expensive
Hypercube
Hypercube are used in real time systems and are built inductively. They are considered as
highly connected direct interconnects.
A fully connected system contains two processors that forms an one dimensional
hypercube. By joining the corresponding switches of two one dimensional hypercubes, a
two dimensional hypercube can be obtained. A three dimensional hypercube is built
from two dimensional hypercubes and so on.
For a hypercube of ‘d’ dimension, 2d processors and d number of switches can be
used.
The figure 1.10 below shows the typical structure of one dimensional, two dimensional
and three dimensional hypercubes
Fig 1.10 Typical structure of one dimensional, two dimensional and three
dimensional hypercubes
Advantage:
Good connectivity as compared with ring and toroidal mesh
Disadvantage:
More powerful switches are needed.
More expensive
2. Indirect Interconnect
There is no direct connection between switch and processor. The generic structure of
indirect interconnects contain unidirectional links and a collection of processors. Each
processor contains an incoming link, an outgoing link and a switching network. Most
commonly used interconnects are: Distributed memory crossbar and omega network
The figure 1.11 below shows the general structure of indirect interconnect networking
Advantage:
Simultaneous communication is possible to certain extent
Disadvantage:
Problem arises when two processor tries to communicate with the same processor
Omega networks
Two by two crossbars are used to build an omega network. Simultaneous communication
is not possible. The figure below shows the structure of omega networks. Consider ‘p’ is
the number of processors and the network uses 2x2 crossbar switches. Therefore omega
network is in need of 2plog2(p) switches
Small-scale shared-memory machines usually support the caching of both shared and
private data. Private data is used by a single processor, while shared data is used by
multiple processors.
1.5.1 Cache Coherence in Multiprocessors
• The difficulty of two different processors having two different values for the same
location is known as Cache Coherence problem.
We initially assume that neither cache contains the variable and that X has the value 1.
After the value of X has been written by A, A’s cache and the memory both contain the
new value, but B’s cache does not, and if B reads the value of X, it will receive 1.
Informally, we could say that a memory system is coherent if any read of a data item
returns the most recently written value of that data item. This simple definition contains
two different aspects of memory system behavior. They are:
The first aspect, called coherence, defines what values can be returned by a read. The
second aspect, called consistency, determines when a written value will be returned by a
read.
Coherence defines the behavior of reads and writes to the same memory location, while
consistency defines the behavior of reads and writes with respect to accesses to other
memory locations.
Write-invalidate
• The processor that is writing data causes copies in the caches of all other
processors in the system to be rendered invalid before it changes its local copy.
The local machine does this by sending an invalidation signal over the bus, which
causes all of the other caches to check for a copy of the invalidated file. Once the
cache copies have been invalidated, the data on the local machine can be updated
until another processor requests it.
Write-update
• The processor that is writing the data broadcasts the new data over the bus
(without issuing the invalidation signal). All caches that contain copies of the data
are then updated. This scheme differs from write-invalidate in that it does not
create only one local copy for writes.
If two processors do attempt to write the same data simultaneously, one of them wins the
race, causing the other processor’s copy to be invalidated. For the other processor to
complete its write, it must obtain a new copy of the data, which must now contain the
updated value. Therefore, this protocol enforces write serialization.
The alternative to an invalidate protocol is to update all the cached copies of a data item
when that item is written. This type of protocol is called a write update or writes
broadcast protocol.
Since write-back caches generate lower requirements for memory bandwidth, they are
greatly preferable in a multiprocessor, despite the slight increase in complexity.
Therefore, we focus on implementation with write-back caches.
Disadvantages:
• Complier mechanisms for transparent software cache coherence are very limited.
• Without cache coherence, the multiprocessor loses the advantage of being to fetch
and use multiple words, such as a cache block and where the fetch data remain
coherent.
• Exclusive—Exactly one processor has a copy of the cache block and it has written
the block, so the memory copy is out of date. The processor is called the owner of
the block.
In addition to tracking the state of each cache block, we must track the processors that
have copies of the block when it is shared, since they will need to be invalidated on a
write.
• The home node is the node where the memory location and the directory entry of
an address reside.
• A remote node is the node that has a copy of a cache block, whether exclusive or
shared. A remote node may be the same as either the local node or the home node.
A catalog of the message types that may be sent between the processors and the
directories. Figure below shows the type of messages sent among nodes.
separate PC, and a separate page table are required for each thread. There are two main
approaches to multithreading.
• Fine-grained multithreading switches between threads on each instruction,
causing the execution of multiples threads to be interleaved. This interleaving is
often done in a round-robin fashion, skipping any threads that are stalled at that
time.
• Coarse-grained multithreading was invented as an alternative to fine grained
multithreading. Coarse-grained multithreading switches threads only on costly
stalls, such as level two cache misses.
• Thus, when core 0 updates the copy of x stored in its cache, if it also broadcasts
this information across the bus, and if core 1 is “snooping” the bus, it will see that
x has been updated and it can mark its copy of x as invalid. This is more or less
how snooping cache coherence works.
• The principal difference is that the broadcast only informs the other cores that the
cache line containing x has been updated, not that x has been updated.
• Snooping works with both write-through and write back caches
2. Amdhals law
It is defined as the potential speedup gained by parallel execution of a program is limited
only by the portion that can be parllelized. Let Fractionparallelized be the amount of task
that can be parallelied and Fractionunparallelized be the amount of task that cant be
parallelized. Therefore TP can be calculated as
TP = Fractionparallelized × TS /n + Fractionunparallelized × TS
Speedup can be calculated as
3. Scalability
Scalability is defined as the ability of a system or a network to the growing amount of
work. The efficiency of the parallel program has been defined as EP with a fixed number
of cores and fixed input size. If we are able to maintain the efficiency as EP even if the
rate of problem size is considerably is increased., then it can be concluded that our
program is scalable.
Let us assume that ‘s’ be the size of the problem, TS=s and TP = s/n+1 then
Consider the number of processes / thread will be mn and the problem size will be s’s.
and s’ can be estimated by solving the equation as below
From the above equation, it is clear that if the problem size is increased at the factor of
number of processes or thread, then the same efficiency can be achieved and hence the
program is scalable.
In technical terms, the above scenario can be called as weakly scalable program. If the
same efficiency can be achieved without increasing the problem size, then the program is
called strongly scalable.
4. Time Factor
Depending on the API, there are many approaches can be used to estimate TS and TP. To
make the estimation easier and simpler. Let us consider the following assumptions:
• The time that elapses between the start of the program and the end of the program
is not considered.
• CPU time is not considered.
• The wall clock time will be considered. And hence there are chance for variations
in timings. Hence instead of taking mean or median time, the minimum time may
be considered.
• In general, a processor will not run more than one thread.
• The time spent for IO operations will not be considered.