0% found this document useful (0 votes)
9 views40 pages

Introduction to Paralel Procesing

The document provides an overview of parallel processing, focusing on the objectives and functions of operating systems, the challenges of parallel programming, and the importance of efficient resource management. It discusses multithreading, shared memory systems, and networking, highlighting the significance of data decomposition for performance scalability. Amdahl's Law is introduced to explain the limitations of speedup in parallelization, along with examples of sum reduction in both shared memory and message passing contexts.

Uploaded by

aa70525
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views40 pages

Introduction to Paralel Procesing

The document provides an overview of parallel processing, focusing on the objectives and functions of operating systems, the challenges of parallel programming, and the importance of efficient resource management. It discusses multithreading, shared memory systems, and networking, highlighting the significance of data decomposition for performance scalability. Amdahl's Law is introduced to explain the limitations of speedup in parallelization, along with examples of sum reduction in both shared memory and message passing contexts.

Uploaded by

aa70525
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 40

Introduction

to Parallel
Processing
Some of the content within these notes may be derived from “Computer Organization and Design, the
Hardware/Software Interface”, Fifth Edition, by David Patterson and John Hennessy
Objectives and Functions of OS
• Convenience
—Making the computer easier to use
• Efficiency
—Allowing better use of computer resources
Layers and Views of a Computer System
Operating System Services
• Program creation
• Program execution
• Access to I/O devices
• Controlled access to files
• System access
• Error detection and response
• Accounting
O/S as a Resource Manager
Desirable Hardware Features
• Memory protection
—To protect the Monitor
• Timer
—To prevent a job monopolizing the system
• Privileged instructions
—Only executed by Monitor
—e.g. I/O
• Interrupts
—Allows for relinquishing and regaining control
Multi-programmed Batch Systems
• I/O devices very slow
• When one program is waiting for I/O, another can use the
CPU
Single Program
Multi-Programming with
Two Programs
Multi-Programming with
Three Programs
• Goal: connecting multiple
computers to get higher
performance
• Multiprocessors
• Scalability, availability, power
efficiency
• Task-level (process-level)
parallelism
• High throughput for independent
jobs
• Parallel processing program
• Single program run on multiple
processors
• Multicore microprocessors
• Chips with multiple processors
Hardware and Software
• Hardware
• Serial: e.g., single core processors and microcontrollers
• Parallel: e.g., multi Core processors
• Software
• Sequential: e.g., matrix multiplication
• Concurrent: e.g., operating system
• Sequential/concurrent software can run on serial/parallel
hardware
• Challenge: making effective use of parallel hardware
Parallel Programming
• Parallel software is the problem
• Need to get significant performance
improvement
• Otherwise, just use a faster uniprocessor, since it’s
easier!
• Difficulties
• Partitioning
• Coordination
• Communications overhead
Amdahl’s Law
• Sequential part can limit speedup
• Example: 100 processors, 90×
speedup?

Tnew = Tparallelizable/100 +
1

Tsequential Speedup
 (1
• Solving: Fparallelizable = 0.999
• Need sequential part to be 0.1% of
original time
Scaling Example
• Workload: sum of 10 scalars, and 10 × 10
matrix sum
• Speed up from 10 to 100 processors
• Single processor: Time = (10 + 100) × tadd
• 10 processors
• Time = 10 × tadd + 100/10 × tadd = 20 × tadd
• Speedup = 110/20 = 5.5 (55% of potential)
• 100 processors
• Time = 10 × tadd + 100/100 × tadd = 11 × tadd
• Speedup = 110/11 = 10 (10% of potential)
• Assumes load can be balanced across
processors
Scaling Example (cont)
• What if matrix size is 100 × 100?
• Single processor: Time = (10 + 10000)
× tadd
• 10 processors
• Time = 10 × tadd + 10000/10 × tadd =
1010 × tadd
• Speedup = 10010/1010 = 9.9 (99% of
potential)
• 100 processors
• Time = 10 × tadd + 10000/100 × tadd =
If code can be decomposed
to limit data set size on
muli-core architectures,
better-than-Amdahl
performance can be
achieved!
Summary
• Parallel programming is hard. Doing it right is harder
• Limit sequential code
• Limit synchronization phases
• Amdahl’s law predicts the maximum speedup that can be
expected by parallelization of your code
• Better-than-Amdahl’s performance can be realized by taking
advantage of intelligent data decomposition
[ END ]
SER 450: Microprocessor Architecture

Multithreading
Multithreading
• Performing multiple threads of execution in parallel
• Replicate registers, PC, etc.
• Fast switching between threads
• Fine-grain multithreading
• Switch threads after each cycle
• Interleave instruction execution
• If one thread stalls, others are executed
• Coarse-grain multithreading
• Only switch on long stall (e.g., L2-cache miss)
• Simplifies hardware, but doesn’t hide short stalls (eg, data
hazards)
Cache size and
associativity must increase
proportional to the
hardware threading to
avoid misses due to
multithreading
You have the power to
set thread affinity to a
specific core
Setting Thread Affinity

The pthread_setaffinity_np() function sets the CPU


affinity mask of the thread thread to the CPU set
pointed to by cpuset. If the call is successful, and
the thread is not currently running on one of the
CPUs in cpuset, then it is migrated to one of
those CPUs.
(from linux manual page)
[ END ]
SER 450: Microprocessor Architecture

Shared Memory Systems


and Networking
Shared Memory
• SMP: shared memory
multiprocessor
• Hardware provides single
physical address space for all
processors
• Synchronize shared variables
using locks
• Memory access time
• UMA (uniform) vs. NUMA
(nonuniform)
Example: Sum Reduction
• Sum 100,000 numbers on 100 processor
UMA
• Each processor has ID: 0 ≤ Pn ≤ 99
• Partition 1000 numbers per processor
• Initial summation on each processor
sum[Pn] = 0;
for (i = 1000*Pn;
i < 1000*(Pn+1); i = i + 1)
sum[Pn] = sum[Pn] + A[i];
• Now need to add these partial
sums
• Reduction: divide and conquer
• Half the processors add pairs,
Example: Sum Reduction
half = 100;
repeat
synch();
if (half%2 != 0 && Pn == 0) sum[0]
= sum[0] + sum[half-1];
/* Conditional sum needed when half
is odd;
Processor0 gets missing element
*/
half = half/2; /* dividing line on who sums */ if (Pn < half)
sum[Pn] = sum[Pn] + sum[Pn+half];
until (half == 1);
Message Passing
• Each processor has private physical address space
• Hardware sends/receives messages between
processors
Loosely Coupled Clusters
• Network of independent computers
• Each has private memory and OS
• Connected using I/O system
• E.g., Ethernet/switch, Internet
• Suitable for applications with independent
tasks
• Web servers, databases, simulations, …
• High availability, scalable, affordable
• Problems
• Administration cost (prefer virtual machines)
• Low interconnect bandwidth
• c.f. processor/memory bandwidth on an SMP
Sum Reduction (Again)
• Sum 100,000 on 100 processors
• First distribute 100 numbers to each
• The do partial sums
sum = 0;
for (i = 0; i<1000; i = i + 1)
sum = sum + AN[i];
• Reduction
• Half the processors send, other half receive
and add
• The quarter send, quarter receive and add, …
Sum Reduction (Again)
• Given send() and receive() operations
limit = 100; half = 100;/* 100 processors */ repeat
half = (half+1)/2; /* send vs. receive
dividing line */
if (Pn >= half && Pn < limit)
send(Pn - half, sum); if
(Pn < (limit/2))
sum = sum +
receive();
limit = half; /* upper limit of senders */
until (half == 1); /* exit with final sum */

• Send/receive also provide


synchronization
• Assumes send/receive take similar time
to addition
Network Characteristics
• Performance
• Latency per message (unloaded
network)
• Throughput
• Link bandwidth
• Total network bandwidth
• Bisection bandwidth
• Congestion delays (depending on
traffic)
• Cost
• Power
• Routability in silicon
Difference between latency and bandwidth
Sources of Latency
• Switching Latency
• Caused by switching/routing hardware
• In the range of 100s of nanoseconds.
• Protocol Overhead
• Every packet sent over the network has a small amount of code that must
be executed before the packet is sent
• Typical packet latencies range between 1uSec and 100mSec depending on
the protocol
• Transmission latency
• The time it takes to transmit all the bytes in the packet over the network
• Impacted by packet size and network bandwidth
Scalability and Networking – It’s about data

1000
Processor
s

• 100,000,000 bytes to process at 10ns per byte


• Zero switching latency
• 5uSec protocol overhead/latency
• 100M bytes/second network capability to/from each node
• Maximum packet size = 1000 bytes of payload
• Every processor must share 1/nth of their data with all other
Scalability and Networking – It’s about data

1000
Processor
s

Computation 1E9*10e-9 = 10 seconds ¼ * 10 seconds = 2.5 seconds (1/1000) * 10 = 0.01 seconds

Protocol 0 5e-6 * ¾ * 1e9 * / 1000 = 3.75 5e-6 * (999/1000) * 1e9 / 1000 = 5


Overhea seconds seconds
d
Transmissio 0 ¾ *1e9 / 100e6 = 7.5 seconds (999/1000) * 1e9 / 100e6 = 9.99 seconds
n Time
Total 10 Seconds 13.75 seconds 17.485 seconds
Summary
• Shared memory and networking can be used to scale
performance
through parallelism
• Data decomposition is key to performance scalability
[ END ]

You might also like