MCP Unit 1
MCP Unit 1
ISSUES
Scalable design principles – Principles of processor design – Instruction
Level Parallelism, Thread level parallelism. Parallel computer models –
Symmetric and distributed shared memory architectures – Performance
Issues – Multi-core Architectures - Software and hardware multithreading –
SMT and CMP architectures – Design issues – Case studies – Intel Multi-
core architecture – SUN CMP architecture.
Definition:
1
Instruction-level parallelism (ILP) is the potential overlap the
execution of instructions using pipeline concept to improve
performance of the system
or
2
Instruction level paralleslism is limited by several
factors. They are
There are three different types of dependences: data
dependences (also called true data dependences), name
dependences, and control dependences.
1. Data Dependence
An instruction j is data dependent on instruction i if either of
the following holds:
a.Instruction i produces a result that may be used by
instruction j, or
b.Instruction j is data dependent on instruction k, and
instruction k is data dependent on
instruction i.
The second condition simply states that one instruction is
dependent on another if there exists a chain of dependences of
the first type between the two instructions. This dependence
chain can be as long as the entire program.
E.g.
Loop: L.D F0, 0(R1) ; F0=array
lement
ADD.D F4, F0 ,F2 ; add scalar in F2
S.D F4, 0(R1) ;store result
DADDUI R1, R1 ,#-8 ;decrement pointer 8
bytes
BNE R1,R2 ,LOOP ; branch R1!
=zero
3
The arrows shows data dependencies with the previous
instructions
2. Name Dependences
The name dependence occurs when two instructions use the
same register or memory location, called a name, but there is
no flow of data between the instructions associated with that
name.
There are two types of name dependences between an
instruction i that precedes instruction j in program order:
a. An anti dependence between instruction i and
instruction j occurs when instruction j writes a register or
memory location that instruction i reads. The original
ordering must be preserved to ensure that i reads the
correct value.
b. An output dependence occurs when instruction i and
instruction j write the same register or memory location.
The ordering between the instructions must be preserved
to ensure that the value finally written corresponds to
instruction j.
Name Dependences can be eliminated by
i. Making instructions to execute simultaneously
ii. Reordered, if the name (register or memory location)
used in the instructions is changed so that the
instructions do not conflict. This is called as register
renaming (can be done statically be a compiler or
dynamically by the hardware.)
Data Hazards
is created whenever there is
a dependence between instructions
overlap caused by pipelining,
reordering of instructions, would change the order of
access to the operand involved in the dependence.
Data hazards may be classified as one of three types,
depending on the order of read and write accesses in the
instructions.
E.g., consider two instructions i and j, with i occurring before j
in program order. The possible data hazards are
4
j tries to read a source before i writes it, so j incorrectly
gets the old value.
This hazard is common and belongs to data dependence
To overcome this program order must be preserved to
ensure that j receives the value from i.
E.g.,
instruction i: load r1, a
instruction j: add r2, r1, r1
due to overlapping if instruction j tries to read r1 before
instruction i completes , then instruction i loads wrong
value.
WAW (write after write)
j tries to write an operand before it is written by i.
This operation leaves the value written by i rather than
the value written by j in the destination.
This hazard corresponds to output dependence.
E.g.,
i: mul r1, r2, r3 exploit
j: add r1, r4, r5
due to overlapping if instruction j tries to write r1 before
instruction i writes, then r1 will have output of addition
value. This can be eliminated through register renaming
which is done by the compiler
i: mul r1, r2, r3
j: add r6, r4, r5
3. Control dependence
5
A control dependence determines the ordering of an
instruction, i, with respect to a branch instruction so that the
instruction i is executed in correct program order.
Every instruction is control dependent on some set of
branches
E.g., the dependence of the statements in the “then” part
of an if statement on the branch.
if p1
{ S1;
};
if p2
{
S2;
}
S1 is control dependent on p1, and S2is control dependent on p2
but not on p1.
In general, there are two constraints imposed by control
dependences:
1. An instruction that is control dependent on a branch
cannot be moved before the branch so that its execution
is no longer controlled by the branch.
o For example, we cannot take an instruction from the
then-portion of an if-statement and move it before if
statement.
2. An instruction that is not control dependent on a branch
cannot be moved after the branch so that its execution is
controlled by the branch.
o For example, we cannot take a statement before the
if-statement and move it into the then-portion.
Control dependence is preserved by two properties in a
simple pipeline
First, instructions execute in program order. This ordering
ensures that an instruction that occurs before a branch is
executed before the branch.
Second, the detection of control or branch hazards
ensures that an instruction that is control dependent on a
branch is not executed until the branch direction is known.
6
Computer architecture (i.e., computers) is classified based on
two dimensions
1. Instruction stream
How many number of instructions that a particular
computer may be able to process at a single point in time
2. Data streams
How many data can be processed at a single point time.
Graphical representation
(MISD)
Instruction
Data
7
3. Multiple Instruction Stream , Single Data Stream
These machines are capable of processing a single data
stream using multiple instruction streams simultaneously.
Generally multiple instructions needs multiple data streams ,
so this class of parallel computer is used as a theoretical
model
Pictorial representation
8
MIMD multiprocessors falls into two categories
9
Shared-memory Architectures usually support the caching of
both shared and private data.
Private data:
10
two processors are seen in the same order by all
processors.
Eg: if the values 1 and then 2 are written to a
location, processors can never read the value of the
location as 2 and then later read it as 1this property
is called write serialization.
Consistency: Consistency defines the behavior of reads and
writes with respect to accesses to other memory locations. It
determines when a written value will be read returned by a
read.
In a coherent multiprocessor, the caches provide both
migration and replication of shared data items.
Migration:
Replication:
11
In the above example at time 3 CPU A and CPU B’s holds
different values for the same data item in their caches. This is
said as cache coherence problem. This problem can be
avoided by Cache Coherence Protocols.
12
Write invalidate Protocol
When one processor writes, invalidates all copies of
this data that may be in other caches
Write Update Protocol
When one processor writes, broadcast the value and
update any copies that may be in other caches
13
Write Update Protocol
This updates all the cached copies of a data item when that
item is written. If a data is not shared, then there is no need to
broadcast or update any other caches.
14
2. With multiword cache blocks, each word written in a cache
block requires a write broadcast in an update protocol,
although only the first write to any word in the block needs to
generate an invalidate in an invalidation protocol. While an
update protocol must work on individual datas.
15
In a write-through cache, it is easy to find the recent
value of a data item, since all written data are always sent to
the memory, from which the most recent value of a data
item can always be fetched.
16
Merged State Transition Diagram
17
DISTRIBUTED SHARED MEMORY ARCHITECTURES
18
A directory is added to each node to implement cache
coherence in a distributed memory multiprocessor. Each
directory is responsible for tracking the caches that share
the memory address of the portion of memory in the node.
States:
Nodes in DSM:
19
The local node is the node where a request originates.
The home node is the node where the memory location
and the directory entry of an address reside.
A remote node is the node that has a copy of a cache
block, whether exclusive (in which case it is the only copy)
or shared. A remote node may be the same as either the
local node or the home node
An Example Directory Protocol
20
Figure shows the actions taken at the directory in response to
messages received.
Uncached State:
21
exclusive to indicate that the only valid copy is cached.
Sharers indicates the identity of the owner.
Shared State:
Exclusive State:
When the block is in the exclusive state the current value of the
block is held in the cache of the processor identified by the set
sharers (the owner), so there are three possible directory
requests
PERFORMANCE ISSUES
Result
24
Since x1 was read by P2 and so invalidated from P2
Commercial Workload
Multiprogramming and OS Workload
Performance Measurements of the Commercial
workload
25
AV(Alta Vista – An Application which performs Web
Search)
26
Cache Miss Types
True sharing
False sharing
Capacity – Occurs when Block cannot be placed into the
available free space
Conflict- Occurs due Mapping
Cold(Or)Compulsory Miss – The First time access to a block
Increase in Cache Size
27
Increase in Processor Count
28
Performance Measurements of the Multiprogramming
and OS Workload
Level 2 cache
Main memory
Disk system
29
execution
time
Performance:
30
Increase the data cache from 32kb to 256 kb causes the user
miss rate to decrease proportionately more than kernel miss
rate
31
Fine-grained multithreading
Coarse-grained multithreading
Fine-grained multithreading
Advantage:
Disadvantage:
Coarse-grained multithreading
Disadvantage:
32
It is limited in its ability to overcome throughput losses,
especially from shorter stalls.
There will be the pipeline start-up costs in coarse-grain
multithreading.
A CPU with coarse-grained multithreading issues
instructions from a single thread, when a stall occurs, the
pipeline must be emptied or frozen. The new thread that
begins executing after the stall must fill the pipeline
before instructions will be able to complete.
Because of this start-up overhead, coarse-grained
multithreading is much more useful for reducing the
penalty of high cost stalls, where pipeline refill is
negligible compared to the stall time.
Simultaneous Multithreading:
33
The figure shows the following
1. a superscalar without multithreading
-limited by a lack of ILP
2. a superscalar with coarse-grained multithreading
- the long stalls are partially hidden by switching to
another thread.
3. a superscalar with fine-grained multithreading
- only one thread issues instructions in a given clock
cycle
4. a superscalar with simultaneous
multithreading(SMT).
-In SMT case, thread-level parallelism (TLP) and
instruction-level parallelism (ILP) are exploited
simultaneously; with multiple threads using the issue slots
in a single clock cycle.
34
Design Challenges in SMT processors
SOFTWARE MULTITHREADING
35
Because thread management is done by the operating
system, kernel threads are generally slower to create and
manage than are user threads.
Most operating systems-including Windows NT, Windows
2000, Solaris 2, BeOS, and Tru64 UNIX (formerly Digital
UN1X)-support kernel threads.
MULTI-THREADING MODELS:
There are three models for thread libraries, each with its own
trade-offs
MANY-TO-ONE:
Advantages:
1. Totally portable
36
2. More efficient
Disadvantages:
ONE-TO-ONE:
Advantages:
Disadvantages:
MANY-TO-MANY:
Advantages:
2. Allows parallelism
HARDWARE MULTITHREADING
37
Multiprocessors
A multiprocessor system has its processors residing in separate
chips and processors are interconnected by a backplane bus.
Multi-core processors
Chip-level multiprocessing(CMP or multicore): integrates
two or more independent cores(normally a CPU) into a
single package composed of a single integrated circuit(IC),
called a die,.
Processors may share an on-chip cache
or each can have its own cache
Examples: HP Mako, IBM Power4
Pictorial representation
Multi-core processor
38
Example: IBM Cell Broadband Engine
a heterogeneous model could have a large centralized
core built for generic processing and running an OS, a core
for graphics, a communications core, an enhanced
mathematics core, an audio core, a cryptographic core,
and the list goes on
advantage
the cores can be trained in such a way that each and every
core can handle different task
disadvantage
difficult to manufacture this type of IC
a Special training is required for operating these cores
Chip Multithreading
Chip Multithreading = Chip Multiprocessing + Hardware
Multithreading
CMP is achieved by multiple cores on a single chip or
multiple threads on a single core.
CMP processors are especially suited to server workloads,
which generally have high levels of Thread-Level
Parallelism(TLP).
Desing Issues / Challenges faced by CMP
Challenges are
1. Power and temperature management
2. Memory/Cache coherence
3. Multithreading
1. Power and Temperature Management
If two cores were placed on a single chip , it consumes twice
power and generated a large amount of heat.
If a processor overheats your computer may even combust
To combat unnecessary power consumption designers
incorporate a power unit which shut down unused cores or
limit the amount power. By powering off unused cores, the
amount of leakage in the chip is reduced.
To lessen the heat generated by multiple cores, number of
hot spots are limited and the heat is spread out across the
chip . For this they designed a temperature unit for
monitoring this.
2. Memory/Cache Coherence
39
Refer cache coherence problem and snooping protocols
3. Multithreading
Using a mulitcore processor to its full potential.
Achieved by rebuilding applications that is able to run in
different cores .
40
Is a platform architecture that provides high-speed
connections between microprocessors and external
memory, and between microprocessors and the I/O hub
Advantage is it is point-to-point. There is no single bus
that all the processors must use and contend with each
other to reach memory and I/O. This improves scalability
and eliminates the competition between processors for
bus bandwidth.
7. Intel virtualization technology
Intel Virtualization Technology (Intel VT) is a set of
hardware enhancements to Intel server and client
platforms that provide software-based virtualization
solutions. Intel VT allows a platform to run multiple
operating systems and applications in independent
partitions, allowing one computer system can function as
multiple virtual system
8. Fully buffered DIMM(dual in-line memory module)
is a memory technology that can be used to increase
reliability and density of memory systems.
As data lines from the memory controller have to be
connected to data lines in every DRAM module, i.e. via
multidrop buses. As memory width, as well as access
speed, increases, the signal degrades at the interface of
the bus and the device. This limits the speed and/or the
memory density. FB-DIMMs take a different approach to
solve this problem
41