0% found this document useful (0 votes)
19 views

Cs8083 MCP Unit I Notes

Uploaded by

sakthivel cse
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views

Cs8083 MCP Unit I Notes

Uploaded by

sakthivel cse
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 31

CS8083 MCP UNIT I

UNIT I
Single core to Multi-core architectures – SIMD and MIMD systems –
Interconnection networks - Symmetric and Distributed Shared Memory
Architectures – Cache coherence - Performance Issues – Parallel program
design
1.1 INTRODUCTION
The Processor is the main component of a computer system. It is a logical circuitry
that processes instructions. It is also called CPU. It is the brain of the computer system.
Processor is mainly responsible to do all the computational calculations, logical decision
making and to control different activities of the system.
The main work of the processor is to execute low level instructions loaded into the
memory. The processor can be manufactured using different technologies:
▪ Single core processor
▪ Multicore processor
The processor can be divided into three types – multiprocessors, multithreaded
processors and multicore processors.

1.1.1 SINGLE CORE PROCESSOR


Single core processors have only one processor in die to process the instructions. A single
core is a single calculation unit or processing unit that executes calculations. Dual core
means a CPU with two calculation units or two processing units. The difference in
performance of dual core and single core varies on the software and how much software
you are running on your computer.
Problems of Single Core Processors
As we try to increase the clock speed of this processor, the amount of heat produced by
the chip also increases. It is a big hindrance in the way of single core processors to
continue evolving.

1 B. Shanmuga Sundari, AP/CSE, PETEC Vallioor


CS8083 MCP UNIT I

Fig 1.1 Single core Processor Architecture


1.1.2 MULTI CORE PROCESSOR
Multicore processors are the latest processors which became available in the market after
2005. These processors use two or more cores to process instructions at the same time by
using hyper threading. The multiple cores are embedded in the same die. The multicore
processor may look like a single processor but it contains two (dual-core), three (tricore),
four (quad-core), six (hexa-core), eight (octa-core) or ten (deca-core) cores. Some
processors even have 22 or 32 cores.

Fig 1.2 Multicore Processor Architecture

2 B. Shanmuga Sundari, AP/CSE, PETEC Vallioor


CS8083 MCP UNIT I

Problems with multicore processors


According to Amdahl’s law, the performance of parallel computing is limited by its serial
components. So, increasing the number of cores may not be the best solution. There is
need to increase the clock speed of individual cores.

Single core vs multicore


Parameter Single-Core Processor Multi-Core Processor
Number of cores on a die Single Multiple
Instruction Execution Can execute Single instruction Can execute multiple
at a time instructions by using multiple
cores
Gain Speed up every program or Speed up the programs which
software being executed are de
Performance Dependent on the clock Dependent on the frequency,
frequency of the core number of cores and program
to be executed
Examples Processor launched before Processor launched after 2005
2005 like 80386,486, AMD like Core-2-Duo,Athlon 64
29000, AMD K6, Pentium X2, I3,I5 and I7 etc.
I,II,III etc.

1.2 SIMD SYSTEMS


In parallel computing, Flynn’s taxonomy is frequently used to classify computer
architectures. It classifies a system according to the number of instruction streams and the
number of data streams it can simultaneously manage.
Single instruction, multiple data, or SIMD, systems are parallel systems.
SIMD systems operate on multiple data streams by applying the same instruction to
multiple data items, so an abstract SIMD system can be thought of as having a single
control unit and multiple ALUs.
An instruction is broadcast from the control unit to the ALUs, and each ALU either
applies the instruction to the current data item, or it is idle.

3 B. Shanmuga Sundari, AP/CSE, PETEC Vallioor


CS8083 MCP UNIT I

Let us consider two arrays x and y, each with n elements, and we want to add the
elements of y to the elements of x:
for (i = 0; i < n; i++)
x[i] += y[i];

Assume that the SIMD systems contains ‘n’ ALUs: The elements of x[i] and y[i] into the
ith ALU, have the ith ALU and it will add y[i] with x[i], and store the result in x[i].
Therefore all the ALUs are busy until the completion of the computation.

Let us assume another situation where the SIMD system contains ‘m’ number of ALUs
and m<n: Now the addition operation can be performed in blocks of ‘m’ elements at a
time. In this case, while operating last block of elements, some of the ALUs will be in
idle state.
Examples: Parallel Super computers, Vector processors, Graphical Processing Units
etc.
Vector processors
By the late 1990s the widely used SIMD systems were Vector Processors. Vector
Processors are used to operate over arrays or vectors of data, while conventional CPUs
operate over individual data elements or scalars.
Characteristics:
Vector registers- These are registers capable of storing a vector of operands and
operating simultaneously on their contents. The vector length is fixed by the system, and
can range from 4 to 128 64-bit elements.
Vector instructions- These are instructions that operate on vectors rather than scalars. If
the vector length is vector length, these instructions have the great virtue that a simple
loop such as
for (i = 0; i < n; i++)
x[i] += y[i];

4 B. Shanmuga Sundari, AP/CSE, PETEC Vallioor


CS8083 MCP UNIT I

requires only a single load, add, and store for each block of vector length elements, while
a conventional system requires a load, add, and store for each element.
Interleaved memory-The memory system consists of multiple “banks” of memory,
which can be accessed more or less independently. After accessing one bank, there will
be a delay before it can be reaccessed, but a different bank can be accessed much sooner.
So if the elements of a vector are distributed across multiple banks, there can be little to
no delay in loading/storing successive elements.
Strided memory - In strided memory access, the program accesses elements of a vector
located at fixed intervals. For example, accessing the first element, the fifth element, the
ninth element, and so on, would be strided access with a stride of four.
Hardware Scatter/gather – Scatter is writing and gather is reading elements of a
vector located at irregular intervals—for example, accessing the first element, the second
element, the fourth element, the eighth element, and so on. Typical vector systems
provide special hardware to accelerate strided access and scatter/gather.
Graphics processing units
Real-time graphics application programming interfaces, or APIs, use points, lines, and
triangles to internally represent the surface of an object. They use a graphics processing
pipeline to convert the internal representation into an array of pixels that can be sent to a
computer screen. Several of the stages of this pipeline are programmable. The behavior
of the programmable stages is specified by functions called shader functions. The shader
functions are typically quite short—often just a few lines of C code.

5 B. Shanmuga Sundari, AP/CSE, PETEC Vallioor


CS8083 MCP UNIT I

1.3 MIMD SYSTEMS


• Multiple instruction, multiple data, or MIMD, systems support multiple
simultaneous instruction streams operating on multiple data streams.
• consist of a collection of fully independent processing units or cores, each of
which has its own control unit and its own ALU.
• MIMD Systems are usually asynchronous, that is the processors can operate at
their own pace
• There are two principal types of MIMD systems: shared-memory systems and
distributed-memory systems.
1.3.1 shared-memory systems
• A collection of autonomous processors is connected to a memory system via an
interconnection network, and each processor can access each memory location.
• Communicate implicitly by accessing shared data structures.

Fig 1.3 Shared Memory Systems


• Two types:
• UMA – Uniform Memory access
• NUMA – Non Uniform Memory Access
UMA Systems are usually easier to program, since the programmer doesn’t need to
worry about different access times for different memory locations. Here the

6 B. Shanmuga Sundari, AP/CSE, PETEC Vallioor


CS8083 MCP UNIT I

Interconnection network connects all the processors directly to main memory. Hence
the time to access all the memory locations will be the same for all the cores. (fig
1.4)
In NUMA, the interconnection network can directly connect each processor to
different blocks of main memory. Here the time to access the memory location to
which a core is directly connected is less when compared with the access time of
other blocks. (fig 1.5)

Fig 1.4 A UMA Multicore system

Fig 1.5 A NUMA Multicore System

7 B. Shanmuga Sundari, AP/CSE, PETEC Vallioor


CS8083 MCP UNIT I

1.3.2 Distributed-memory systems


• Each processor is paired with its own private memory, and the processor memory
pairs communicate over an interconnection network
• The processors usually communicate explicitly by sending messages or by using
special functions

Fig 1.6 A Distributed Memory System

1.4 INTERCONNECTION NETWORKS


The communication between the processor and the main memory is achieved through
Interconnection networks. It plays a major role in the performance of both shared and
distributed memory systems
1.4.1 Shared Memory Interconnects
The most widely used interconnects are buses and crossbars
1. Buses:
A bus is a collection of parallel communication wires together with some hardware that
controls access to the bus. The key characteristic of a bus is that the communication wires
are shared by the devices that are connected to it.
Advantages:
Buses are implemented at low cost

8 B. Shanmuga Sundari, AP/CSE, PETEC Vallioor


CS8083 MCP UNIT I

Good flexibility since multiple devices can be connected to a bus with little additional
cost
Disadvantages:
If the number of devices connected to the bus increases, the contention for use of the bus
also increases since communication wires are shared. This in turn decreases the expected
performance of the bus.
If there are more number of processors connected to a bus, then the processors would
frequently wait for access to main memory

2. Crossbars
If there are more number of systems connected, buses are replaced by switched
interconnects. To control the routing of data among the connected devices, switches are
used. The most commonly used one is crossbar switch.
Here the communication links are bidirectional. The circles represent the switches. The
configuration of the switch is shown in the figure
When we use crossbar switches, there will be a conflict between two cores, if they
attempt to accept the same memory modules simultaneously. Fig below shows the
simultaneous memory access using crossbar switches.
• P1 writes to M4
• P2 reads from M3
• P3 reads from M1
• P4 writes to M2
Advantages:
Much faster than buses
Performance is good for simultaneous accessing
Disadvantages:
Cost of the switches and links are high

9 B. Shanmuga Sundari, AP/CSE, PETEC Vallioor


CS8083 MCP UNIT I

Fig 1.7 Crossbar switch

1.4.2 Distributed Memory Interconnects


Distributed-memory interconnects are often divided into two groups: direct
interconnects and indirect interconnects.

10 B. Shanmuga Sundari, AP/CSE, PETEC Vallioor


CS8083 MCP UNIT I

1. Direct interconnect
In a direct interconnect each switch is directly connected to a processor memory pair,
and the switches are connected to each other.
The ideal direct interconnect is a fully connected network in which each switch is
directly connected to every other switch. (fig 1.8)

Fig 1.8 A fully connected network


Most commonly used direct interconnects are: Ring, Two dimensional toroidal mesh,
hypecube

Ring
A ring is used to carry out multiple simultaneous communications. If there are ‘n’
processors, the number of links is 2n.

Fig 1.9 Typical structure of Ring and two dimensional toroidal mesh

11 B. Shanmuga Sundari, AP/CSE, PETEC Vallioor


CS8083 MCP UNIT I

Advantages:
Simple in nature
Less expensive
Disadvantages:
Some processors should be idle until other processors complete their tasks

Toroidal mesh
In toroial mesh, the complex switches are mainly used. If there are ‘n’ processors, the
number of links are 3n.
Advantage:
Good connectivity as compared with ring
Disadvantage:
Complex in nature
More expensive

Hypercube
Hypercube are used in real time systems and are built inductively. They are considered as
highly connected direct interconnects.
A fully connected system contains two processors that forms an one dimensional
hypercube. By joining the corresponding switches of two one dimensional hypercubes, a
two dimensional hypercube can be obtained. A three dimensional hypercube is built
from two dimensional hypercubes and so on.
For a hypercube of ‘d’ dimension, 2d processors and d number of switches can be
used.
The figure 1.10 below shows the typical structure of one dimensional, two dimensional
and three dimensional hypercubes

12 B. Shanmuga Sundari, AP/CSE, PETEC Vallioor


CS8083 MCP UNIT I

Fig 1.10 Typical structure of one dimensional, two dimensional and three
dimensional hypercubes

Advantage:
Good connectivity as compared with ring and toroidal mesh
Disadvantage:
More powerful switches are needed.
More expensive

2. Indirect Interconnect
There is no direct connection between switch and processor. The generic structure of
indirect interconnects contain unidirectional links and a collection of processors. Each
processor contains an incoming link, an outgoing link and a switching network. Most
commonly used interconnects are: Distributed memory crossbar and omega network
The figure 1.11 below shows the general structure of indirect interconnect networking

13 B. Shanmuga Sundari, AP/CSE, PETEC Vallioor


CS8083 MCP UNIT I

Fig 1.11 general structure of indirect interconnect networking

Distributed memory crossbar


A Distributed memory crossbar is similar to a shared memory crossbar. But distributed
memory crossbar use unidirectional links whereas shared memory crossbar uses
bidirectional links. Fig below shows the structure of distributed shared memory crossbar.

Fig 1.12 Structure of distributed shared memory crossbar


Simultaneous communication between processors is possible until more than one
processor do not communicate with the same processor. If ‘p’ is considered as the
number of processors, the distributed memory crossbar needs p2 switches.

14 B. Shanmuga Sundari, AP/CSE, PETEC Vallioor


CS8083 MCP UNIT I

Advantage:
Simultaneous communication is possible to certain extent
Disadvantage:
Problem arises when two processor tries to communicate with the same processor

Omega networks
Two by two crossbars are used to build an omega network. Simultaneous communication
is not possible. The figure below shows the structure of omega networks. Consider ‘p’ is
the number of processors and the network uses 2x2 crossbar switches. Therefore omega
network is in need of 2plog2(p) switches

Fig 1.13 Omega Network

15 B. Shanmuga Sundari, AP/CSE, PETEC Vallioor


CS8083 MCP UNIT I

1.5 SYMMETRIC SHARED MEMORY ARCHITECTURES


• Consists of several processors with a single physical memory shared by all
processors through a shared bus which is shown below
• Also called as Centralized Shared memory architecture which incorporate a single
main memory. It maintains a Symmetric relationship to all processors
• Access time from any processor to main memory is uniform in nature.
• These systems will have uniform latency from memory hence they are called as
UMA systems

Fig 1.14 Basic Structure of Centralized Shared memory multiprocessor

Small-scale shared-memory machines usually support the caching of both shared and
private data. Private data is used by a single processor, while shared data is used by
multiple processors.
1.5.1 Cache Coherence in Multiprocessors
• The difficulty of two different processors having two different values for the same
location is known as Cache Coherence problem.

16 B. Shanmuga Sundari, AP/CSE, PETEC Vallioor


CS8083 MCP UNIT I

We initially assume that neither cache contains the variable and that X has the value 1.
After the value of X has been written by A, A’s cache and the memory both contain the
new value, but B’s cache does not, and if B reads the value of X, it will receive 1.
Informally, we could say that a memory system is coherent if any read of a data item
returns the most recently written value of that data item. This simple definition contains
two different aspects of memory system behavior. They are:
The first aspect, called coherence, defines what values can be returned by a read. The
second aspect, called consistency, determines when a written value will be returned by a
read.

A memory system is coherent if


• A read by a processor, P, to a location X that follows a write by P to X, with no
writes of X by another processor occurring between the write and the read by P,
always returns the value written by P.
• A read by a processor to location X that follows a write by another processor to X
returns the written value if the read and write are sufficiently separated in time and
no other writes to X occur between the two accesses.
• Writes to the same location are serialized: that is, two writes to the same location
by any two processors are seen in the same order by all processors. For example, if
the values 1 and then 2 are written to a location, processors can never read the
value of the location as 2 and then later read it as 1.

17 B. Shanmuga Sundari, AP/CSE, PETEC Vallioor


CS8083 MCP UNIT I

Coherence defines the behavior of reads and writes to the same memory location, while
consistency defines the behavior of reads and writes with respect to accesses to other
memory locations.

1.5.2 Basic Schemes for Enforcing Coherence


Coherent caches provide migration, since a data item can be moved to a local cache and
used there in a transparent fashion. Coherent caches also provide replication for shared
data that is being simultaneously read, since the caches make a copy of the data item in
the local cache. Replication reduces both latency of access and contention for a read
shared data item.
The protocols to maintain coherence for multiple processors are called cache-coherence
protocols. There are two classes of protocols:
• Directory based—The sharing status of a block of physical memory is kept in just
one location, called the directory;
• Snooping—Every cache that has a copy of the data from a block of physical
memory also has a copy of the sharing status of the block, and no centralized state
is kept. The caches are usually on a shared-memory bus, and all cache controllers
monitor or snoop on the bus to determine whether or not they have a copy of a
block that is requested on the bus.

1.5.3 Snooping Protocols

There are two main types of snooping protocol:

Write-invalidate

• The processor that is writing data causes copies in the caches of all other
processors in the system to be rendered invalid before it changes its local copy.
The local machine does this by sending an invalidation signal over the bus, which

18 B. Shanmuga Sundari, AP/CSE, PETEC Vallioor


CS8083 MCP UNIT I

causes all of the other caches to check for a copy of the invalidated file. Once the
cache copies have been invalidated, the data on the local machine can be updated
until another processor requests it.

Write-update

• The processor that is writing the data broadcasts the new data over the bus
(without issuing the invalidation signal). All caches that contain copies of the data
are then updated. This scheme differs from write-invalidate in that it does not
create only one local copy for writes.

If two processors do attempt to write the same data simultaneously, one of them wins the
race, causing the other processor’s copy to be invalidated. For the other processor to
complete its write, it must obtain a new copy of the data, which must now contain the
updated value. Therefore, this protocol enforces write serialization.

Fig 1.15 An example of an invalidation protocol working on a snooping bus for a


single cache block(X) with write back caches.

The alternative to an invalidate protocol is to update all the cached copies of a data item
when that item is written. This type of protocol is called a write update or writes
broadcast protocol.

19 B. Shanmuga Sundari, AP/CSE, PETEC Vallioor


CS8083 MCP UNIT I

Fig 1.16 An example of a write update or broadcast protocol working on a snooping


bus for a single cache block(X) with write-back caches.

1.5.4. Basic Implementation Techniques


The serialization of access enforced by the bus also forces serialization of writes, since
when two processors compete to write to the same location, one must obtain bus access
before the other. The first processor to obtain bus access will cause the other processor’s
copy to be invalidated, causing writes to be strictly serialized. One implication of this
scheme is that a write to a shared data item cannot complete until it obtains bus access.
• In a write-back operation, any new, requested processor data is written to the
cache, but not in the memory
• Write-back cache works in contrast to write-through cache, which simultaneously
writes on cache and memory.
• write-back caches can use the same snooping scheme both for caches misses and
for writes:
• Each processor snoops every address placed on the bus. If a processor finds that it
has a dirty copy of the requested cache block, it provides that cache block in
response to the read request and causes the memory access to be aborted.

20 B. Shanmuga Sundari, AP/CSE, PETEC Vallioor


CS8083 MCP UNIT I

Since write-back caches generate lower requirements for memory bandwidth, they are
greatly preferable in a multiprocessor, despite the slight increase in complexity.
Therefore, we focus on implementation with write-back caches.
Disadvantages:
• Complier mechanisms for transparent software cache coherence are very limited.
• Without cache coherence, the multiprocessor loses the advantage of being to fetch
and use multiple words, such as a cache block and where the fetch data remain
coherent.

1.6 DISTRIBUTED SHARED-MEMORY ARCHITECTURES


• The processors use multiple physically distributed memories. By distributing
memory among the processors, the large processor counts can be handled with
reduced bandwidth demands and without long access latency.
• Since the latency of accessing of data differs for the processors, it is called Non
uniform Memory Access.
• DSM) is a form of memory architecture where the (physically separate) memories
can be addressed as one (logically shared) address space. Here, the term "shared"
does not mean that there is a single centralized memory but "shared" means that
the address space is shared.
• The first coherence protocol is known as a directory protocol. A directory keeps
the state of every block that may be cached. Information in the directory includes
which caches have copies of the block, whether it is dirty, and so on

21 B. Shanmuga Sundari, AP/CSE, PETEC Vallioor


CS8083 MCP UNIT I

Fig 1.17 A directory is added to each node to implement cache coherence in a


distributed memory multiprocessor

1.6.1 Directory-Based Cache-Coherence Protocols: The Basics


There are two primary operations that a directory protocol must implement:
Handling a read miss and handling a write to a shared, clean cache block.
(Handling a write miss to a shared block is a simple combination of these two.)
To implement these operations, a directory must track the state of each cache block.
Three states
• Shared—Block is shared by multiple processors and the value is updated in all
caches
• Uncached—No processor has a copy of the cache block. ie.. Block is not shared
by any processor

22 B. Shanmuga Sundari, AP/CSE, PETEC Vallioor


CS8083 MCP UNIT I

• Exclusive—Exactly one processor has a copy of the cache block and it has written
the block, so the memory copy is out of date. The processor is called the owner of
the block.
In addition to tracking the state of each cache block, we must track the processors that
have copies of the block when it is shared, since they will need to be invalidated on a
write.

Typically three processors involved


• The local node is the node where a request originates.

23 B. Shanmuga Sundari, AP/CSE, PETEC Vallioor


CS8083 MCP UNIT I

• The home node is the node where the memory location and the directory entry of
an address reside.
• A remote node is the node that has a copy of a cache block, whether exclusive or
shared. A remote node may be the same as either the local node or the home node.
A catalog of the message types that may be sent between the processors and the
directories. Figure below shows the type of messages sent among nodes.

1.7 SYNCHRONIZATION AND VARIOUS HARDWARE PRIMITIVES


1.7.1 Synchronization
Synchronization mechanisms are typically built with user-level software routines that rely
on hardware-supplied synchronization instructions. The efficient spin locks can be built
using a simple hardware synchronization instruction and the coherence mechanism.

1.7.2. Basic Hardware Primitives


• The key ability we require to implement synchronization in a multiprocessor is a
set of hardware primitives with the ability to atomically read and modify a
memory location.
• Without such a capability, the cost of building basic synchronization primitives
will be too high and will increase as the processor count increases.
Operations for building synchronization:
• atomic exchange, which interchanges a value in a register for a value in memory.
• Simple lock where the value 0 is used to indicate that the lock is free and a 1 is
used to indicate that the lock is unavailable.
• test-and-set, which tests a value and sets it if the value passes the test. For
example, we could define an operation that tested for 0 and set the value to 1
• fetch-and-increment: it returns the value of a memory location and atomically
increments it.

24 B. Shanmuga Sundari, AP/CSE, PETEC Vallioor


CS8083 MCP UNIT I

1.7.3 Implementing Locks Using Coherence


• spin locks: locks that a processor continuously tries to acquire, spinning around a
loop until it succeeds. Spin locks are used when a programmer expects the lock to
be held for a very short amount of time and when she wants the process of locking
to be low latency when the lock is available.
1.7.4 Synchronization Performance Challenges
Barrier Synchronization
• A barrier forces all processes to wait until all the processes reach the barrier and
then releases all of the processes. A typical implementation of a barrier can be
done with two spin locks: one used to protect a counter that tallies the processes
arriving at the barrier and one used to hold the processes until the last process
arrives at the barrier.
• The first primitive deals with locks, while the second is useful for barriers and a
number of other user-level operations that require counting or supplying distinct
indices. In both cases we can create a hardware primitive where latency is
essentially identical but with much less serialization, leading to better scaling
when there is contention.
1.7.5 Hardware Primitives
In this section we look at two hardware synchronization primitives. The first primitive
deals with locks, while the second is useful for barriers and a number of other user-level
operations that require counting or supplying distinct indices. In both cases we can create
a hardware primitive where latency is essentially identical to our earlier version, but with
much less serialization, leading to better scaling when there is contention.

1.7.6 Multithreading: Exploiting Thread-Level Parallelism within a Processor


Multithreading allows multiple threads to share the functional units of a single processor
in an overlapping fashion. To permit this sharing, the processor must duplicate the
independent state of each thread. For example, a separate copy of the register file, a

25 B. Shanmuga Sundari, AP/CSE, PETEC Vallioor


CS8083 MCP UNIT I

separate PC, and a separate page table are required for each thread. There are two main
approaches to multithreading.
• Fine-grained multithreading switches between threads on each instruction,
causing the execution of multiples threads to be interleaved. This interleaving is
often done in a round-robin fashion, skipping any threads that are stalled at that
time.
• Coarse-grained multithreading was invented as an alternative to fine grained
multithreading. Coarse-grained multithreading switches threads only on costly
stalls, such as level two cache misses.

1.8 CACHE COHERENCE


In multiprocessor system where many processes needs a copy of same memory block, the
maintenance of consistency among these copies raises a raises a problem referred to
as Cache Coherence Problem.

This occurs mainly due to these causes:-


• Sharing of writable data.
• Process migration.
• Inconsistency due to I/O
There are two main approaches to insuring cache coherence:
1. Snooping cache coherence
2. Directory-based cache coherence

1. Snooping cache coherence


• The idea behind snooping comes from bus-based systems:
• When the cores share a bus, any signal transmitted on the bus can be “seen” by all
the cores connected to the bus.

26 B. Shanmuga Sundari, AP/CSE, PETEC Vallioor


CS8083 MCP UNIT I

• Thus, when core 0 updates the copy of x stored in its cache, if it also broadcasts
this information across the bus, and if core 1 is “snooping” the bus, it will see that
x has been updated and it can mark its copy of x as invalid. This is more or less
how snooping cache coherence works.
• The principal difference is that the broadcast only informs the other cores that the
cache line containing x has been updated, not that x has been updated.
• Snooping works with both write-through and write back caches

2. Directory-based cache coherence


• Snooping cache coherence isn’t scalable, because for larger systems it will cause
performance to degrade.
• Directory-based cache coherence protocols attempt to solve this problem
through the use of a data structure called a directory. The directory stores the
status of each cache line. Typically, this data structure is distributed.
• Example: Each core/memory pair might be responsible for storing the part of the
structure that specifies the status of the cache lines in its local memory. Thus,
when a line is read into, say, core 0’s cache, the directory entry corresponding to
that line would be updated indicating that core 0 has a copy of the line.
• When a variable is updated, the directory is consulted, and the cache controllers of
the cores that have that variable’s cache line in their caches are invalidated.
Clearly there will be substantial additional storage required for the directory, but
when a cache variable is updated, only the cores storing that variable need to be
contacted.
• False sharing is a common problem in shared memory parallel processing. It
occurs when two or more cores hold a copy of the same memory cache line.

27 B. Shanmuga Sundari, AP/CSE, PETEC Vallioor


CS8083 MCP UNIT I

1.9 PERFORMANCE ISSUES


Parallel programs are used to improve the performance and efficiency of a system. The
following are some of the performance aspects:
1.9.1 Speedup and efficiency
Speed up is defined as the process for improving the performance by increasing the speed
of execution of a task.
Linear speedup of Parallel programs
Consider there are n processors in parallel architecture and the task is shared equally
among the processors. If this is scenario, our parallel program will run n times faster than
the serial program. Let us consider ‘TS’ as serial program run time and ‘TP’ as parallel
program run time. Then ‘TP’ can be calculated as,
TP= TS/n
This is called linear speed up of parallel programs
Speed up of Parallel programs
Distributed memory programs transmit the data across the network, which is much
slower than local memory access. If there are more number of processes or threads, then
overhead is also more.
Taking all above challenges into consideration, speedup of parallelization (SP) is defined
as the ratio between serial run time(TS) and parallel runtime (TP). Therefore Speedup
(SP) can be calculated as,
SP = TS/TP
Parallelization includes some additional overheads such as task distribution, critical
section execution, coordination.. The time taken for this overhead is considered as
Overhead time (TO). Then
TP=TS/n + TO

28 B. Shanmuga Sundari, AP/CSE, PETEC Vallioor


CS8083 MCP UNIT I

Efficiency of Parallel programs


Efficiency of Parallel programs is defined as the ratio between speedup of parallel
program (SP) and the number of processors (n) involved. Therefore EP can be computed
as
EP = SP/n

2. Amdhals law
It is defined as the potential speedup gained by parallel execution of a program is limited
only by the portion that can be parllelized. Let Fractionparallelized be the amount of task
that can be parallelied and Fractionunparallelized be the amount of task that cant be
parallelized. Therefore TP can be calculated as
TP = Fractionparallelized × TS /n + Fractionunparallelized × TS
Speedup can be calculated as

3. Scalability
Scalability is defined as the ability of a system or a network to the growing amount of
work. The efficiency of the parallel program has been defined as EP with a fixed number
of cores and fixed input size. If we are able to maintain the efficiency as EP even if the
rate of problem size is considerably is increased., then it can be concluded that our
program is scalable.
Let us assume that ‘s’ be the size of the problem, TS=s and TP = s/n+1 then

Consider the number of processes / thread will be mn and the problem size will be s’s.
and s’ can be estimated by solving the equation as below

29 B. Shanmuga Sundari, AP/CSE, PETEC Vallioor


CS8083 MCP UNIT I

Let us assume s’=m, then we get

From the above equation, it is clear that if the problem size is increased at the factor of
number of processes or thread, then the same efficiency can be achieved and hence the
program is scalable.
In technical terms, the above scenario can be called as weakly scalable program. If the
same efficiency can be achieved without increasing the problem size, then the program is
called strongly scalable.

4. Time Factor
Depending on the API, there are many approaches can be used to estimate TS and TP. To
make the estimation easier and simpler. Let us consider the following assumptions:
• The time that elapses between the start of the program and the end of the program
is not considered.
• CPU time is not considered.
• The wall clock time will be considered. And hence there are chance for variations
in timings. Hence instead of taking mean or median time, the minimum time may
be considered.
• In general, a processor will not run more than one thread.
• The time spent for IO operations will not be considered.

30 B. Shanmuga Sundari, AP/CSE, PETEC Vallioor


CS8083 MCP UNIT I

1.10 PARALLEL PROGRAM DESIGN


In order to improve the efficiency and performance of the system, the serial program
should be parallelized. This section deals with the challenges and steps to design a
parallel program that has been given as a serial program
Steps for building Parallel programs
Ian Foster provides an outline of steps
1. Partitioning. Divide the computation to be performed and the data operated on by the
computation into small tasks. The focus here should be on identifying tasks that can be
executed in parallel.
2. Communication. Determine what communication needs to be carried out among the
tasks identified in the previous step.
3. Agglomeration or aggregation. Combine tasks and communications identified in the
first step into larger tasks. For example, if task A must be executed before task B can be
executed, it may make sense to aggregate them into a single composite task.
4. Mapping. Assign the composite tasks identified in the previous step to
processes/threads. This should be done so that communication is minimized, and each
process/thread gets roughly the same amount of work.
This is sometimes called Foster’s methodology.

31 B. Shanmuga Sundari, AP/CSE, PETEC Vallioor

You might also like