Parallel Processing: sp2016 Lec#5
Parallel Processing: sp2016 Lec#5
sp2016
lec#5
Dr M Shamim Baig
1.1
1.2
Parallel Platforms:
Memory (Physical vs Logical) Configurations
Physical vs Logical Memory Config
Physical Memory config (SM, DM, CSM)
Logical Address Space config (SAS, NSAS)
Combinations
CSM + SAS (SMP; UMA)
DM + SAS (DSM; NUMA)
DM + NSAD (Multicomputer/Clusters)
1.4
UMA vs NUMA
SM-multiprocessors are further categorized based on
memory access delay as UMA (uniform memory
access) & NUMA (non uniform memory access)
UMA system is based on (CSM + SAS) config,
where each processor has same delay for
accessing any memory location
NUMA system is based on (DM+SAS = DSM)
config, where a processor may have different
delay for accessing different memory location.
1.6
1.7
Shared memory
Bus
Examples:
Dual Pentiums
Quad Pentiums
1.8
Processor
Processor
Processor
L1 cache
L1 cache
L1 cache
L1 cache
L2 Cache
L2 Cache
L2 Cache
L2 Cache
Bus interface
Bus interface
Bus interface
Bus interface
Processor/
memory
bus
I/O interface
Memory controller
I/O bus
Shared memory
Memory
1.9
Local
memory
Computers
1.10
DataExchange/Synch Platforms:
Shared-memory vs Message-Passing
Shared memory platforms have low comm
overhead, can support lower grain levels,
while message passing platforms have more
comm overhead & therefore are more suited
for coarse grain levels
SM Multiprocessors are faster, but have poor
scalability
Message passing Multicomputer platforms
are slower but have higher scalability.
1.12
1.13
Beowulf Clusters*
A group of interconnected commodity computers
achieving high performance with low cost.
Typically using commodity interconnects e.g
high speed Ethernet & OS e.g Linux.
* Beowulf comes from name given by NASA Goddard
Space Flight Center cluster project.
1.14
Interconnection Network:
o Interface level: memory bus (using MBEU) in SMmultiprocessors (UMA, NUMA) vs I/O bus (using NIU)
in multicomputer / cluster
o Data Exchange / sync:
Shared Data model vs Message Passing model
1.18
Homework:
self assessed problems
Please mark your solution & note
the marks you achieved
???????
1.19
Problems:
Explicit Parallel Architectures
1.20
Example Problem1:
Bus based SM-Multiprocessor
Limit of Parallelism
Consider a SM-Multiprocessor using
32-bit RISC processors running at 150
MHz, carries out one instruction per
clock cycle. Assume 15% data-load &
10% data-store instructions using
shared Bus having 2GB/sec BW.
Compute Max number of processors
possible to connect on the above Bus
for following parallel configurations:1.21
Example Problems:
Bus based SM-Multiprocessor:
Limit of Parallelism.contd
(a) SMP (without cache memory)
(b) SMP with cache memory
having hit-ratio of 95% &
memory write-through policy
(c) NUMA with program Locality
factor = 80 %
1.22
Bus-based interconnects (a) with no local caches; (b) with local memory/caches.
Since much of the data accessed by processors is local to the processor, a
local memory can improve the performance of bus-based machines. Example??
1.23
Homework:
self assessed problem
Please mark your solution & note
the marks you achieved
???????
1.25
Example Problem2:
Message Passing Multicomputer,
Local vs Remote memory data access delays
Consider 64-node multicomputer, each node comprises of
32-bit RISC processor having 250 MHz clock rate & 8 MB
local memory. The Local memory access requires 4 clock
cycles, remote comm initiate (setup) overhead is 15 clock
cycles & the Interconnection Network BW is 80 MB/sec.
Total number of instructions executed are 200,000.
If memory data load & store are 15% & 10% respectively
of the instructions, compute:(a)Load/ store time if all accesses are to local nodes
(b)Load/ store time if 20% of accesses are to remote nodes
note: Assume Packet lengths are variable (depend on addr
& data bytes) & communication protocol given???.
Size of packet fields is in multiple of bytes.
1.26
Local
memory
Computers
1.27