ACA UNIT-5 Notes
ACA UNIT-5 Notes
Parallel computing architectures breaks the job into discrete parts that can be executed
concurrently. Each part is further broken down to a series of instructions. Instructions
from each part execute simultaneously on different CPUs. Parallel systems deal with
the simultaneous use of multiple computer resources that can include a single
computer with multiple processors, a number of computers connected by a network to
form a parallel processing cluster or a combination of both.
Multiprocessor: A Computer system with atleast two processors.
Job – level parallelism (or) process – level parallelism: Utilizing multiple
processors by running independent programs simultaneously.
Parallel Processing Program: A single program that runs on multiple processor
simultaneously.
Multicore microprocessor: A microprocessor containing multiple
processors(“Cores”) in a single integrated circuit. Parallel systems are more difficult
to program than computers with a single processor because the architecture of parallel
computers varies accordingly and the processes of multiple CPUs must be
coordinated and synchronized. The crux of parallel processing are the CPUs.
Parallelism in computer architecture is explained used Flynn”s taxonomy. This
classification is based on the number of instruction and data streams used in the
architecture. The machine structure is explained using streams which are sequence of
items. The four categories in Flynn’s taxonomy based on the number of instruction
streams and data streams are the following:
The use of large, multilevel caches can substantially reduce the memory
bandwidth demands of a processor is the key insight that motivates centralized
memory multiprocessors.
These processors were all single-core and often took an entire board, and memory
was located on a shared bus.
With more recent, higher-performance processors, the memory demands have
outstripped the capability of reasonable buses, and recent microprocessors
directly connect memory to a single chip, which is sometimes called a backside
or memory bus to distinguish it from the bus used to connect to I/O.
Accessing a chip’s local memory whether for an I/O operation or for an access
from another chip requires going through the chip that “owns” that memory.
Thus access to memory is asymmetric: faster to the local memory and slower to
the remote memory.
In a multicore that memory is shared among all the cores on a single chip, but the
asymmetric access to the memory of one multicore from the memory of another
usually remains.
Symmetric shared-memory machines usually support the caching of both shared
and private data.
Private data are used by a single processor, while shared data are used by multiple
processors, essentially providing communication among the processors through
reads and writes of the shared data.
When a private item is cached, its location is migrated to the cache, reducing the
average access time as well as the memory bandwidth required. Because no other
processor uses the data, the program behavior is identical to that in a uniprocessor.
When shared data are cached, the shared value may be replicated in multiple
caches.
In addition to the reduction in access latency and required memory bandwidth,
this replication also provides a reduction in contention that may exist for shared
data items that are being read by multiple processors simultaneously.
Multiprocessor Cache Coherence
Unfortunately, caching shared data introduces a new problem. Because the view
of memory held by two different processors is through their individual caches, the
processors could end up seeing different values for the same memory location.
This difficulty is generally referred to as the cache coherence problem. Notice
that the coherence problem exists because we have both a global state, defined
primarily by the main memory, and a local state, defined by the individual caches,
which are private to each processor core
A memory system is coherent if any read of a data item returns the most recently
written value of that data item.
A memory system is coherent if
1. A read by processor P to location X that follows a write by P to X, with no writes
of X by another processor occurring between the write and the read by P, always
returns the value written by P.
2. A read by a processor to location X that follows a write by another processor to X
returns the written value if the read and write are sufficiently separated in time and no
other writes to X occur between the two accesses.
3. Writes to the same location are serialized; that is, two writes to the same location
by any two processors are seen in the same order by all processors. For example, if
the values 1 and then 2 are written to a location, processors can never read the value
of the location as 2 and then later read it as 1.
The first property simply preserves program order—we expect this property to be
true even in uniprocessors.
The second property defines the notion of what it means to have a coherent view
of memory: if a processor could continuously read an old data value, we would
clearly say that memory was incoherent.
The need for write serialization is more subtle, but equally important. Suppose we
did not serialize writes, and processor P1 writes location X followed by P2
writing location X.
Serializing the writes ensures that every processor will see the write done by P2
at some point. If we did not serialize the writes, it might be the case that some
processors could see the write of P2 first and then see the write of P1,
maintaining the value written by P1 indefinitely.
The simplest way to avoid such difficulties is to ensure that all writes to the same
location are seen in the same order; this property is called write serialization.
Although the three properties just described are sufficient to ensure coherence,
the question of when a written value will be seen is also important.
Coherence and consistency are complementary: Coherence defines the behavior
of reads and writes to the same memory location, while consistency defines the
behavior of reads and writes with respect to accesses to other memory locations.
Basic Schemes for Enforcing Coherence
The coherence problem for multiprocessors and I/O, although similar in origin,
has different characteristics that affect the appropriate solution.
Unlike I/O, where multiple data copies are a rare event—one to be avoided
whenever possible—a program running on multiple processors will normally
have copies of the same data in several caches.
In a coherent multiprocessor, the caches provide both migration and replication of
shared data items.
Coherent caches provide migration because a data item can be moved to a local
cache and used there in a transparent fashion.
This migration reduces both the latency to access a shared data item that is
allocated remotely and the bandwidth demand on the shared memory.
Because the caches make a copy of the data item in the local cache, coherent
caches also provide replication for shared data that are being read simultaneously.
Replication reduces both latency of access and contention for a read shared data
item.
Supporting this migration and replication is critical to performance in accessing
shared data.
Thus, rather than trying to solve the problem by avoiding it in software,
multiprocessors adopt a hardware solution by introducing a protocol to maintain
coherent caches.
The protocols to maintain coherence for multiple processors are called cache
coherence protocols.
Key to implementing a cache coherence protocol is tracking the state of any
sharing of a data block.
The state of any cache block is kept using status bits associated with the block,
similar to the valid and dirty bits kept in a uniprocessor cache.
There are two classes of protocols in use, each of which uses different techniques
to track the sharing status:
Directory based—The sharing status of a particular block of physical memory is
kept in one location, called the directory. There are two very different types of
directory-based cache coherence. In an SMP, we can use one centralized
directory, associated with the memory or some other single serialization point,
such as the outermost cache in a multicore. In a DSM, it makes no sense to have a
single directory because that would create a single point of contention and make
it difficult to scale to many multicore chips given the memory demands of
multicores with eight or more cores.
Snooping—Rather than keeping the state of sharing in a single directory, every
cache that has a copy of the data from a block of physical memory could track the
sharing status of the block. In an SMP, the caches are typically all accessible via
some broadcast medium (e.g., a bus connects the per-core caches to the shared
cache or memory), and all cache controllers monitor or snoop on the medium to
determine whether they have a copy of a block that is requested on a bus or
switch access. Snooping can also be used as the coherence protocol for a
multichip multiprocessor, and some designs support a snooping protocol on top
of a directory protocol within each multicore.
Snooping protocols became popular with multiprocessors using microprocessors
(single-core) and caches attached to a single shared memory by a bus.
The bus provided a convenient broadcast medium to implement the snooping
protocols.
There are two ways to maintain the coherence requirement described in the prior
section.
One method is to ensure that a processor has exclusive access to a data item
before writing that item.
This style of protocol is called a write invalidate protocol because it invalidates
other copies on a write. It is by far the most common protocol.
Exclusive access ensures that no other readable or writable copies of an item exist
when the write occurs: all other cached copies of the item are invalidated.
Shows an example of an invalidation protocol with write-back caches in action.
To see how this protocol ensures coherence, consider a write followed by a read
by another processor: because the write requires exclusive access, any copy held
by the reading processor must be invalidated (thus the protocol name).
Therefore when the read occurs, it misses in the cache and is forced to fetch a
new copy of the data.
For a write, we require that the writing processor has exclusive access, preventing
any other processor from being able to write simultaneously.
If two processors do attempt to write the same data simultaneously, one of them
wins the race (we’ll see how we decide who wins shortly), causing the other
processor’s copy to be invalidated.
For the other processor to complete its write, it must obtain a new copy of the
data, which must now contain the updated value. Therefore this protocol enforces
write serialization.
It is one of the region for data Mainly the message passing is used for
communication communication.
It is used for communication between single It is used in distributed environments where the
processor and multiprocessor systems where communicating processes are present on remote
the processes that are to be communicated machines which are connected with the help of
present on the same machine and they are a network.
sharing common address space.
The shared memory code that has to be read Here no code is required because the message
or write the data that should be written passing facility provides a mechanism for
explicitly by the application programmer. communication and synchronization of actions
that are performed by the communicating
processes.
In shared memory make sure that the Message passing is useful for sharing small
processes are not writing to the same amounts of data so that conflicts need not
location simultaneously. occur.
Given below is the structure of shared Given below is the structure of message
memory system − passing system −
Shared Memory Message Passing