0% found this document useful (0 votes)
12 views

MCP Unit 1

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views

MCP Unit 1

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 41

UNIT I INTRODUCTION TO MULTIPROCESSORS AND SCALABILITY

ISSUES
Scalable design principles – Principles of processor design – Instruction
Level Parallelism, Thread level parallelism. Parallel computer models –
Symmetric and distributed shared memory architectures – Performance
Issues – Multi-core Architectures - Software and hardware multithreading –
SMT and CMP architectures – Design issues – Case studies – Intel Multi-
core architecture – SUN CMP architecture.

Book to Refer - 1. John L. Hennessey and David A. Patterson, “


Computer architecture – A quantitative approach”, Morgan
Kaufmann/Elsevier Publishers, 4th. edition, 2007.

INSTRUCTION LEVEL PRALLELISM

1. What is a parallel instruction?


 Parallel instructions are a set of instructions that do not
depend on each other to be executed.
 Hierarchy of parallel instructions are
a. Bit level Parallelism : 16 bit add on 8 bit processor
b. Instruction level Parallelism
c. Loop level Parallelism
for (i=1; i<=1000; i= i+1)
x[i] = x[i] + y[i];
d. Thread level Parallelism (SMT, multi-core computers)
e. Processor Level Parallelism
2. Implementations of ILP
i. Pipelining

ii. Superscalar Architecture


a. Dependency checking on chip.
b. Multiple Processing Elements eg. ALU, Shift
iii. VLIW (Very Long Instruction Word Architecture)
a.Simple hardware, Complex Compiler
iv. Multi processor computers

Definition:

1
Instruction-level parallelism (ILP) is the potential overlap the
execution of instructions using pipeline concept to improve
performance of the system
or

Instruction-level parallelism (ILP) is a measure of how many of


the operations in a computer program can be performed
simultaneously. The potential overlap among instructions is
called instruction level parallelism.
or
ILP is a measure of the number of instructions that can be
performed during a single clock cycle.
There are two approaches to exploiting ILP.
1. Static Technique – Software Dependent
2. Dynamic Technique – Hardware Dependent

CPI (Cycles per Instruction) for a pipelined processor is the sum


of the base CPI and all contributions from stalls:
Pipeline CPI = Ideal pipeline CPI + Structural stalls +
Data hazard stalls + Control stalls

o Ideal pipeline CPI: measure of the maximum performance


attainable by the implementation
o Structural hazards: HW cannot support this combination of
instructions
o Data hazards: Instruction depends on result of prior instruction
still in the pipeline
o Control hazards: Caused by delay between the fetching of
instructions and decisions about changes in control flow
(branches and jumps)
By reducing each of the terms of the right-hand side, we
minimize the overall pipeline CPI and thus increase the IPC
(Instructions per Clock).

Note: parallelism is possible only when there are no


dependencies between instructions.

2
Instruction level paralleslism is limited by several
factors. They are
There are three different types of dependences: data
dependences (also called true data dependences), name
dependences, and control dependences.

1. Data Dependence
An instruction j is data dependent on instruction i if either of
the following holds:
a.Instruction i produces a result that may be used by
instruction j, or
b.Instruction j is data dependent on instruction k, and
instruction k is data dependent on
instruction i.
The second condition simply states that one instruction is
dependent on another if there exists a chain of dependences of
the first type between the two instructions. This dependence
chain can be as long as the entire program.
E.g.
Loop: L.D F0, 0(R1) ; F0=array
lement
ADD.D F4, F0 ,F2 ; add scalar in F2
S.D F4, 0(R1) ;store result
DADDUI R1, R1 ,#-8 ;decrement pointer 8
bytes
BNE R1,R2 ,LOOP ; branch R1!
=zero

The data dependences in this code sequence involve both


floating point data:

3
The arrows shows data dependencies with the previous
instructions

2. Name Dependences
 The name dependence occurs when two instructions use the
same register or memory location, called a name, but there is
no flow of data between the instructions associated with that
name.
 There are two types of name dependences between an
instruction i that precedes instruction j in program order:
a. An anti dependence between instruction i and
instruction j occurs when instruction j writes a register or
memory location that instruction i reads. The original
ordering must be preserved to ensure that i reads the
correct value.
b. An output dependence occurs when instruction i and
instruction j write the same register or memory location.
The ordering between the instructions must be preserved
to ensure that the value finally written corresponds to
instruction j.
 Name Dependences can be eliminated by
i. Making instructions to execute simultaneously
ii. Reordered, if the name (register or memory location)
used in the instructions is changed so that the
instructions do not conflict. This is called as register
renaming (can be done statically be a compiler or
dynamically by the hardware.)

Data Hazards
 is created whenever there is
 a dependence between instructions
 overlap caused by pipelining,
 reordering of instructions, would change the order of
access to the operand involved in the dependence.
 Data hazards may be classified as one of three types,
depending on the order of read and write accesses in the
instructions.
E.g., consider two instructions i and j, with i occurring before j
in program order. The possible data hazards are

RAW (read after write)

4
 j tries to read a source before i writes it, so j incorrectly
gets the old value.
 This hazard is common and belongs to data dependence
 To overcome this program order must be preserved to
ensure that j receives the value from i.
 E.g.,
instruction i: load r1, a
instruction j: add r2, r1, r1
due to overlapping if instruction j tries to read r1 before
instruction i completes , then instruction i loads wrong
value.
WAW (write after write)
 j tries to write an operand before it is written by i.
 This operation leaves the value written by i rather than
the value written by j in the destination.
 This hazard corresponds to output dependence.
 E.g.,
i: mul r1, r2, r3 exploit
j: add r1, r4, r5
due to overlapping if instruction j tries to write r1 before
instruction i writes, then r1 will have output of addition
value. This can be eliminated through register renaming
which is done by the compiler
i: mul r1, r2, r3
j: add r6, r4, r5

WAR (write after read)


 j tries to write a destination before it is read by i
 This operation makes i to get the new value incorrectly.
 This hazard corresponds to anti dependence.
 E.g.,
i: mul r1, r2, r3
j: add r2, r4, r5
due to overlapping if instruction j tries to write r2 before
instruction i reads, then the final output will be incorrect.
This can be eliminated through register renaming which
is done by the compiler
i: mul r1, r2, r3
j: add r6, r4, r5

3. Control dependence

5
 A control dependence determines the ordering of an
instruction, i, with respect to a branch instruction so that the
instruction i is executed in correct program order.
 Every instruction is control dependent on some set of
branches
 E.g., the dependence of the statements in the “then” part
of an if statement on the branch.
if p1
{ S1;
};
if p2
{
S2;
}
 S1 is control dependent on p1, and S2is control dependent on p2
but not on p1.
 In general, there are two constraints imposed by control
dependences:
1. An instruction that is control dependent on a branch
cannot be moved before the branch so that its execution
is no longer controlled by the branch.
o For example, we cannot take an instruction from the
then-portion of an if-statement and move it before if
statement.
2. An instruction that is not control dependent on a branch
cannot be moved after the branch so that its execution is
controlled by the branch.
o For example, we cannot take a statement before the
if-statement and move it into the then-portion.
 Control dependence is preserved by two properties in a
simple pipeline
 First, instructions execute in program order. This ordering
ensures that an instruction that occurs before a branch is
executed before the branch.
 Second, the detection of control or branch hazards
ensures that an instruction that is control dependent on a
branch is not executed until the branch direction is known.

Parallel computing platforms/Flynn’s


classification(Taxonomy)/Parallel computer models

6
 Computer architecture (i.e., computers) is classified based on
two dimensions
1. Instruction stream
How many number of instructions that a particular
computer may be able to process at a single point in time
2. Data streams
How many data can be processed at a single point time.
 Graphical representation

Multiple Instruction, Single Multiple Instruction, Multiple


Data Data(MIMD)

(MISD)
Instruction

Single Instruction ,Single Single Instruction , Multiple


Data(SISD) Data(SIMD)

Data

1. Single Instruction Stream , Single Data Stream


 Traditional sequential computer that provides no parallelism.
 Called as uniprocessor
 Instructions are executed in a serial fashion.
 Only on data stream is processed by the cpu during a given
clock cycle
 E.g. IBM PC, older mainframe computers
 Pictorial Representations

2. Single Instruction and Multiple data Streams


A single Instruction is executed by multiple processors using
different data streams.
Each processor has its own data memory(i.e., different
multiple data)
There is only one instruction memory and one control
processor(fetches and dispatches).
These machines are used in applications such as Digital
signal processing, image processing and multimedia
applications( audio and video)
Pictorial representation

7
3. Multiple Instruction Stream , Single Data Stream
 These machines are capable of processing a single data
stream using multiple instruction streams simultaneously.
 Generally multiple instructions needs multiple data streams ,
so this class of parallel computer is used as a theoretical
model
 Pictorial representation

4. Multiple Instruction Streams, Multiple Data Streams


 Each processor fetches its own instructions and operates on
its own data.
 The processors are often off-the –shelf microprocessors.
 These machines are most common paralled computing
platform today.
 E.g., Multicore systems Intel core Duo processor
 Pictorial Representation

8
MIMD multiprocessors falls into two categories

1. Centralized Shared Memory Architecture (or) Symmetric


Shared Memory Multiprocessor.(SSM)
2. Distributed Memory Architecture(DSM)

SYMMETRIC SHARED MEMORY MULTIPROCESSORS

Single main memory is accessed by all processors and


there is a uniform access time from any processor, this style of
architecture is also called as uniform memory access(UMA).

Multiple processor cache subsystems share the same


physical memory, connected by a bus

9
Shared-memory Architectures usually support the caching of
both shared and private data.

Private data:

 Private data is used by a single processor.


 When a private item is cached, its location is migrated to
the cache, reducing the average access time as well as
the memory bandwidth required.
Shared data:

 Shared data is used by multiple processors, essentially


providing communication among the processors through
reads and writes of the shared data.
 When shared data are cached, the shared value may be
replicated in multiple caches.
 Caching of shared data, however, introduces a new
problem: cache coherence.

Coherence : Coherence defines the behavior of reads and


writes to the same memory location,

 If P writes to X then reads X, with no writes to X by other


processors, then the value read should be that written by
P.
 If P writes to X and then another processor reads from X, if
read/write sufficiently separated then the value read
should be that written by P.
 Write serialization: Writes to the same location are
serialized, that is, two writes to the same location by any

10
two processors are seen in the same order by all
processors.
Eg: if the values 1 and then 2 are written to a
location, processors can never read the value of the
location as 2 and then later read it as 1this property
is called write serialization.
Consistency: Consistency defines the behavior of reads and
writes with respect to accesses to other memory locations. It
determines when a written value will be read returned by a
read.
In a coherent multiprocessor, the caches provide both
migration and replication of shared data items.

Migration:

Coherent caches provide migration, since a data item can


be moved to a local cache and used there in a transparent
fashion. This migration reduces both the latency to access a
shared data item that is allocated remotely and the bandwidth
demand on the shared memory.

Replication:

Coherent caches also provide replication for shared data


that is being simultaneously read, since the caches make a
copy of the data item in the local cache.

Cache Coherence Problem:

This problem occurs when different processors can have


different values for the same location

11
In the above example at time 3 CPU A and CPU B’s holds
different values for the same data item in their caches. This is
said as cache coherence problem. This problem can be
avoided by Cache Coherence Protocols.

Cache Coherence Protocols:

The protocols to maintain coherence for multiple


processors are called cache coherence protocols.

There are two classes of protocols, which use different


techniques to track the sharing status:

Directory based—The sharing status of a block of physical


memory is kept in just one location, called the directory.

Snooping—Every cache that has a copy of the data from a


block of physical memory also has a copy of the sharing status
of the block, and no centralized state is kept.

 The caches are usually on a shared-memory bus.


 All cache controllers monitor or snoop on the bus to
determine whether or not they have a copy of a block that
is requested on the bus.
Protocols for Coherency(Snoopy Protocols):

12
 Write invalidate Protocol
 When one processor writes, invalidates all copies of
this data that may be in other caches
 Write Update Protocol
 When one processor writes, broadcast the value and
update any copies that may be in other caches

Write invalidate Protocol

Write invalidate protocol is a method to ensure that a


processor has exclusive access to a data item before it write
that item. It invalidates other copies on a write

 Write to shared data: an invalidate is sent to all caches


which snoops and invalidate any copies
 The updated copy is available through
o Write-through: memory is always up-to-date
o Write-back: snoop in caches to find most recent
copy
Example

13
Write Update Protocol

This updates all the cached copies of a data item when that
item is written. If a data is not shared, then there is no need to
broadcast or update any other caches.

Processo Bus activity Content Contents of Contents


r activity s of CPU B’s of
CPU A’s cache memory
cache location X
0
CPU A Cache miss for X 0 0
reads X
CPU B Cache miss for X 0 0 0
reads X
CPU A Write broadcast 1 1 1
writes 1 to of X
X
CPU B 1 1 1
reads X

Performance differences between write update and


write invalidate protocols:

1. Multiple writes to the same word with no intervening reads


require multiple write broadcasts in an update protocol, but
only one initial invalidation in a write invalidate protocol.

14
2. With multiword cache blocks, each word written in a cache
block requires a write broadcast in an update protocol,
although only the first write to any word in the block needs to
generate an invalidate in an invalidation protocol. While an
update protocol must work on individual datas.

3. The delay between writing a word in one processor and


reading the written value in another processor is usually less in
a write update scheme, since the written data are immediately
updated.

Write Invalidation Implementation Techniques:

1) use of the bus to perform invalidates.


 To perform an invalidate the processor simply acquires
bus access and broadcasts the address to be invalidated
on the bus.
 All processors continuously snoop on the bus watching the
addresses.
 The processors check whether the address on the bus is in
their cache. If so, the corresponding data in the cache is
invalidated.
2) When two processors compete to write to the same location,
one must obtain bus access before the other. The first
processor to obtain bus access will cause the other
processor’s copy to be invalidated, causing writes to be
strictly serialized.
3) Updation can be done by
Write Through cache

15
In a write-through cache, it is easy to find the recent
value of a data item, since all written data are always sent to
the memory, from which the most recent value of a data
item can always be fetched.

Write Back cache:

Each processor snoops every address placed on the


bus. If a processor finds that it has a dirty copy(old copy) of
the requested cache block, it provides that cache block in
response to the read request and causes the memory access
to be aborted.

4) To track whether or not a cache block is shared we can add


an extra state bit associated with each cache block. By adding
a bit indicating whether the block is shared. The processor with
the sole copy of a cache block is normally called the owner of
the cache block.

5) When invalidation is sent, the state of the owner’s cache


block is changed from shared to unshared (or exclusive).
6) If another processor later requests this cache block, the
state must be made shared again.
Coherence Protocol is implemented in each node

16
Merged State Transition Diagram

17
DISTRIBUTED SHARED MEMORY ARCHITECTURES

 Physical memory is distributed to all the processors.


 A directory keeps the state of every block that may be
cached. Information in the directory includes which caches
have copies of the block, whether it is dirty, and so on.
 For larger multiprocessors, the directory structure should
scaled.

18
 A directory is added to each node to implement cache
coherence in a distributed memory multiprocessor. Each
directory is responsible for tracking the caches that share
the memory address of the portion of memory in the node.

Directory-Based Cache-Coherence Protocols:

States:

Shared—One or more processors have the block cached, and


the value in memory is up to date (as well as in all the caches).

Uncached—No processor has a copy of the cache block.

Exclusive—Exactly one processor has a copy of the cache


block and it has written the block, so the memory copy is out of
date. The processor is called the owner of the block.

 In addition to tracking the state of each cache block, we


must track the processors that have copies of the block
when it is shared, since they will need to be invalidated on
a write.
 The simplest way to do this is to keep a bit for each
memory block.
 When the block is shared, each bit of the vector indicates
whether the corresponding processor has a copy of that
block.
 Bit vector is also used to keep track of the owner of the
block when the block is in the exclusive state.

Nodes in DSM:
19
 The local node is the node where a request originates.
 The home node is the node where the memory location
and the directory entry of an address reside.
 A remote node is the node that has a copy of a cache
block, whether exclusive (in which case it is the only copy)
or shared. A remote node may be the same as either the
local node or the home node
An Example Directory Protocol

The state transitions for an individual cache are caused by read


misses, write misses, invalidates, and data fetch requests.

20
Figure shows the actions taken at the directory in response to
messages received.

The directory receives three different requests: read miss, write


miss, and data write back. The messages sent in response by
the directory are shown in bold, while the updating of the set
Sharers is shown in bold italics.

Uncached State:

When the block is uncached state the directory can


receive only two kinds of messages.

Read miss—The requesting processor is sent the


requested data from memory and the requestor is made
the only sharing node. The state of the block is made
shared.

Write miss—The requesting processor is sent the value


and becomes the Sharing node. The block is made

21
exclusive to indicate that the only valid copy is cached.
Sharers indicates the identity of the owner.

Shared State:

When the block is in the shared state the memory is up-


todate, so the same two requests can occur:

Read miss—The requesting processor is sent the


requested data from memory and the requesting
processor is added to the sharing set.

Write miss—The requesting processor is sent the value.


All processors in the set Sharers are sent invalidate
messages, and the Sharers set is to contain the identity of
the requesting processor. The state of the block is made
exclusive.

Exclusive State:

When the block is in the exclusive state the current value of the
block is held in the cache of the processor identified by the set
sharers (the owner), so there are three possible directory
requests

Read miss—The owner processor is sent a data fetch


message, which causes the state of the block in the
owner’s cache to transition to shared and causes the
owner to send the data to the directory, where it is written
to memory and sent back to the requesting processor. The
identity of the requesting processor is added to the set
sharers, which still contains the identity of the processor
that was the owner.
22
Data write-back—The owner processor is replacing the
block and therefore must write it back. This write-back
makes the memory copy up to date (the home directory
essentially becomes the owner), the block is now
uncached, and the sharer set is empty.

Write miss—The block has a new owner. A message is


sent to the old owner causing the cache to invalidate the
block and send the value to the directory, from which it is
sent to the requesting processor, which becomes the new
owner. Sharers is set to the identity of the new owner, and
the state of the block remains exclusive.

PERFORMANCE ISSUES

PERFORMANCE OF SYMMENTRIC SHARED MEMORY(SSM)

 Cache performance is a combination of the behaviour of


o Uniprocessor cache miss traffic
o Traffic caused by communication-invalidation and
subsequent cache misses
 Changing the processor count, cache size and block size
can affect these two components of miss rate
Coherency Misses

The misses that arise from interprocessor communication are


called coherence Miss. It can be broken into two separate
sources

1. True sharing misses


2. False sharing misses
True sharing miss
23
True sharing misses arise from the communication of data
through the cache coherence mechanism

 Invalidates due to first write to shared block


 Reads by another CPU of modified block in different cache
 Miss would still occur if block size were one word
False sharing miss

False sharing misses when a block is invalidated because


some word in the block, other than the one being read is
written into

 Invalidation does not cause a new value to be


communicated, but only causes an extra cache miss
 Block is shared, but no word in block is actually
shared(Miss would not occur if block size were 1 word)
Example

Assume that words x1 and x2 are in the same cache block,


which is in the shared state in the caches of P1 and P2.
Assuming the following sequence of events, identify each miss
as a true sharing miss or a false sharing miss.

Result

1: True sharing miss

24
Since x1 was read by P2 and so invalidated from P2

2: False sharing miss

X2 was invalidated by the write of x1 in P1, but that value


of x1 is not used in P2

3: False sharing miss

The block containing x1 is marked shared due to the


reading p2 but p2 did not read x1. A write miss is required to
obtain exclusive access to the block.

4: False sharing miss

5: True sharing miss

 Since the value being read was written by P2


Performance Measurements

The following are the different performance measurements


of symmetric shared memory multiprocessors

 Commercial Workload
 Multiprogramming and OS Workload
Performance Measurements of the Commercial
workload

The following model are taken for the performance


Measurements of the Commercial Workload

 Performance of 3 types of application such as OLTP, DSS


and AltaVista are compared.
 OLTP application performs Online Transaction Processing
 DSS(Decision support System)

25
 AV(Alta Vista – An Application which performs Web
Search)

 The performance of the DSS and AltaVista workloads is


reasonable
 The performance of the OLTP workload is very poor(due to
poor performance of the memory hierarchy)
To improve OLTP Performance the following factors are
considered

 Increase in Cache Size


 Increase in Block Size
 Increase in Processor Count

26
Cache Miss Types
 True sharing
 False sharing
 Capacity – Occurs when Block cannot be placed into the
available free space
 Conflict- Occurs due Mapping
 Cold(Or)Compulsory Miss – The First time access to a block
Increase in Cache Size

 True sharing and false sharing unchanged going from 1


MB to 8 MB(L3 cache).
 Uniprocessor cache misses improve with cache size
increases(Instruction, Capacity/Conflict, Compulsory)
 The cold, false sharing and true sharing are unaffected
by increase in cache size

27
Increase in Processor Count

The contribution to memory access cycles increases as


processor count increase primarily due to increased true
sharing.

Increase in Block Size

Reduces Capacity Miss & Compulsory Miss remaining


are unaffected

28
Performance Measurements of the Multiprogramming
and OS Workload

For the workload measurements, we assume the following


memory and I/O systems

Leve1 instruction cache

Level 1 data cache

Level 2 cache

Main memory

Disk system

User Kernel Synchronizatio CPU


Executio Executio n idle
n n Wait for I/O
% 27 3 1 69
instructio
n
executed
% 27 7 2 64

29
execution
time

Execution time is broken into four components:

1. Idle – Execution in the kernel mode idle loop

2. User – Execution in user code

3.Synchronization- Execution of waiting for synchronization


variables

4.Kernel – Execution in the OS that is neither idle nor in


synchronization access

Performance:

Even though user code executes more instruction miss rate


for kernel is high because of the following reasons

i) Kernel initializes all pages before allocating to


a user
ii) Kernel shares data

30
Increase the data cache from 32kb to 256 kb causes the user
miss rate to decrease proportionately more than kernel miss
rate

MULTITHREADING or THREAD LEVEL PARALLELISM

 Multithreading allows multiple threads to share the


functional units of a single processor in an overlapping
fashion.
 To permit this sharing, the processor must duplicate the
independent state of each thread.
 For example, a separate copy of the register file, a
separate PC, and a separate page table are required for
each thread.
 The memory can be shared through the virtual memory
mechanisms.
 Hardware must support the ability to change to a different
thread relatively quickly;(i.e) a thread switch.

There are two main approaches to multithreading.

31
 Fine-grained multithreading
 Coarse-grained multithreading

Fine-grained multithreading

 It switches between threads on each instruction, causing


the execution of multiples threads to be interleaved.
 This interleaving is often done in a round-robin fashion,
skipping any threads that are stalled at that time.
 In fine-grained multithreading, the CPU must be able to
switch threads on every clock cycle.

Advantage:

 It can hide the throughput losses that arise from both


short and long stalls, since instructions from other threads
can be executed when one thread stalls.

Disadvantage:

 It slows down the execution of the individual threads,


since a thread that is ready to execute without stalls will
be delayed by instructions from other threads.

Coarse-grained multithreading

 Coarse-grained multithreading was invented as an


alternative to fine-grained multithreading.
 Coarse-grained multithreading switches threads only on
costly stalls, such as level two(L2) cache misses.

Disadvantage:

32
 It is limited in its ability to overcome throughput losses,
especially from shorter stalls.
 There will be the pipeline start-up costs in coarse-grain
multithreading.
 A CPU with coarse-grained multithreading issues
instructions from a single thread, when a stall occurs, the
pipeline must be emptied or frozen. The new thread that
begins executing after the stall must fill the pipeline
before instructions will be able to complete.
 Because of this start-up overhead, coarse-grained
multithreading is much more useful for reducing the
penalty of high cost stalls, where pipeline refill is
negligible compared to the stall time.

Simultaneous Multithreading:

 Converting Thread-Level Parallelism into Instruction-


Level Parallelism
 It exploits TLP at the same time it exploits ILP
 Since multiple functional units are available and
multiple issue is possible TLP and ILP can be
implemented.
 Based upon dynamically scheduled processor
 With register renaming and dynamic scheduling,
multiple instructions from independent threads can
be issued without regard to the dependences among
them.

33
The figure shows the following
1. a superscalar without multithreading
-limited by a lack of ILP
2. a superscalar with coarse-grained multithreading
- the long stalls are partially hidden by switching to
another thread.
3. a superscalar with fine-grained multithreading
- only one thread issues instructions in a given clock
cycle
4. a superscalar with simultaneous
multithreading(SMT).
-In SMT case, thread-level parallelism (TLP) and
instruction-level parallelism (ILP) are exploited
simultaneously; with multiple threads using the issue slots
in a single clock cycle.

Factors that limit the issue slot:

 how many active threads are considered


 finite limitations on buffers
 ability to fetch enough instructions from multiple threads,

34
Design Challenges in SMT processors

There are a variety of design challenges for an SMT processor,


including:

 dealing with a larger register file needed to hold multiple


contexts
 maintaining low overhead on the clock cycle, particularly
in critical steps such as instruction issue, where more
candidate instructions need to be considered, and in
instruction completion, where choosing what instructions
to commit may be challenging
 ensuring that the cache conflicts generated by the
simultaneous execution of multiple threads do not cause
significant performance degradation.

SOFTWARE MULTITHREADING

Multithreading without hardware support for storing state (PC,


registers, etc.) for multiple threads simultaneously; often a
thread / CPU, reduced utilization

There are two levels of thread

1. User level(for user thread)

2. Kernel level(for kernel thread)

USER LEVEL THREADS:

 Kernel threads are supported directly by the operating


system.
 The kernel performs thread creation, scheduling, and
management in kernel space.

35
 Because thread management is done by the operating
system, kernel threads are generally slower to create and
manage than are user threads.
 Most operating systems-including Windows NT, Windows
2000, Solaris 2, BeOS, and Tru64 UNIX (formerly Digital
UN1X)-support kernel threads.

MULTI-THREADING MODELS:

There are three models for thread libraries, each with its own
trade-offs

1. Many threads on one LWP (many-to-one)

2. One thread per LWP (one-to-one)

3. Many threads on many LWPs (many-to-many)

KERNAL LEVEL THREADS:

 User threads are supported above the kernel and are


implemented by a thread library at the user level.
 The library provides support for thread creation,
scheduling, and management with no support from the
kernel.
 Because the kernel is unaware of user-level threads, all
thread creation and scheduling are done in user space
without the need for kernel intervention.
 User-level threads are generally fast to create and
manage User-thread libraries include POSIX Pthreads,
Mach C-threads, and Solaris 2 UI-threads.

MANY-TO-ONE:

The many-to-one model maps many user-level threads to one


kernel thread.

Advantages:

1. Totally portable

36
2. More efficient

Disadvantages:

1. It cannot take advantage of parallelism

2. The entire process is block if a thread makes a blocking


system call

3. Mainly used in language systems, portable libraries like


Solaris

ONE-TO-ONE:

It allows parallelism Provide more concurrency

Advantages:

It allows parallelism Provide more concurrency

Disadvantages:

Each user thread requires corresponding kernel thread limiting


the number of total threads Used in Linux Threads and other
systems like Windows 2000,Windows NT

MANY-TO-MANY:

The many-to-many model multiplexes many user-level threads


to a smaller or equal number of kernel threads.

Advantages:

1. Can create as many user thread as necessary

2. Allows parallelism

Disadvantages: Kernel thread can burden the performance

HARDWARE MULTITHREADING

[write about fine grained, coarse grained and SMT ]

MULTI-CORE ARCHITECTURES (Or)


CHIP MULTIPROCESSING(CMP) ARCHITECTURES

37
Multiprocessors
A multiprocessor system has its processors residing in separate
chips and processors are interconnected by a backplane bus.
Multi-core processors
 Chip-level multiprocessing(CMP or multicore): integrates
two or more independent cores(normally a CPU) into a
single package composed of a single integrated circuit(IC),
called a die,.
 Processors may share an on-chip cache
or each can have its own cache
 Examples: HP Mako, IBM Power4
Pictorial representation

Single core processor

Multi-core processor

Different Multicore Classes


1. Homogenous Multicore
 Replication of the same processor type on the die (called as
chip) as a shared memory multiprocessor.
 Examples: AMD and Intel dual- and quad-core processors
2. Heterogeneous/Hybrid Multicore
 Different processor types on a die

38
 Example: IBM Cell Broadband Engine
a heterogeneous model could have a large centralized
core built for generic processing and running an OS, a core
for graphics, a communications core, an enhanced
mathematics core, an audio core, a cryptographic core,
and the list goes on
 advantage
the cores can be trained in such a way that each and every
core can handle different task
 disadvantage
 difficult to manufacture this type of IC
 a Special training is required for operating these cores

Chip Multithreading
Chip Multithreading = Chip Multiprocessing + Hardware
Multithreading
 CMP is achieved by multiple cores on a single chip or
multiple threads on a single core.
 CMP processors are especially suited to server workloads,
which generally have high levels of Thread-Level
Parallelism(TLP).
Desing Issues / Challenges faced by CMP
Challenges are
1. Power and temperature management
2. Memory/Cache coherence
3. Multithreading
1. Power and Temperature Management
 If two cores were placed on a single chip , it consumes twice
power and generated a large amount of heat.
 If a processor overheats your computer may even combust
 To combat unnecessary power consumption designers
incorporate a power unit which shut down unused cores or
limit the amount power. By powering off unused cores, the
amount of leakage in the chip is reduced.
 To lessen the heat generated by multiple cores, number of
hot spots are limited and the heat is spread out across the
chip . For this they designed a temperature unit for
monitoring this.
2. Memory/Cache Coherence
39
Refer cache coherence problem and snooping protocols
3. Multithreading
Using a mulitcore processor to its full potential.
Achieved by rebuilding applications that is able to run in
different cores .

Case Study : Intel Multicore Processor


The features of Intel Multicore processor
1. Intel Turbo Boost Technology
It regulates the power supply to all the cores and for
unused cores it shuts the power supply .
The cores are allowed to operate at lower frequency
2. Intel Hyperthreading
 uses processor resources more efficiently, enabling
multiple threads to run on each core.
 it also increases processor throughput, improving
overall performance on threaded software.
 Used in Intel Xeon processor family.
3. Intel advanced smart cache
is a multi-core optimized cache that improves
performance and efficiency by increasing the probability
that each execution core of a dual-core processor can
access data from a higher-performance, more-efficient
cache subsystem. To accomplish this, Intel shares L2
cache between cores.

4. Intel advanced memory access


improves system performance by optimizing the use of the
available data bandwidth from the memory subsystem
and hiding the latency of memory accesses.
The goal is to ensure that data can be used as quickly as
possible and that this data is located as close as possible
to where it’s needed to minimize latency and thus improve
efficiency and speed.

5. Intel micro archicture


 Uses advanced nano technology.
 Designed to deliver increased performance combined
with superior power efficiency
 The chips are organized in such a way that they are
energy efficient
6. Intel Quick Path Inteconnect

40
Is a platform architecture that provides high-speed
connections between microprocessors and external
memory, and between microprocessors and the I/O hub
Advantage is it is point-to-point. There is no single bus
that all the processors must use and contend with each
other to reach memory and I/O. This improves scalability
and eliminates the competition between processors for
bus bandwidth.
7. Intel virtualization technology
Intel Virtualization Technology (Intel VT) is a set of
hardware enhancements to Intel server and client
platforms that provide software-based virtualization
solutions. Intel VT allows a platform to run multiple
operating systems and applications in independent
partitions, allowing one computer system can function as
multiple virtual system
8. Fully buffered DIMM(dual in-line memory module)
is a memory technology that can be used to increase
reliability and density of memory systems.
As data lines from the memory controller have to be
connected to data lines in every DRAM module, i.e. via
multidrop buses. As memory width, as well as access
speed, increases, the signal degrades at the interface of
the bus and the device. This limits the speed and/or the
memory density. FB-DIMMs take a different approach to
solve this problem

41

You might also like