Introduction to Paralel Procesing
Introduction to Paralel Procesing
to Parallel
Processing
Some of the content within these notes may be derived from “Computer Organization and Design, the
Hardware/Software Interface”, Fifth Edition, by David Patterson and John Hennessy
Objectives and Functions of OS
• Convenience
—Making the computer easier to use
• Efficiency
—Allowing better use of computer resources
Layers and Views of a Computer System
Operating System Services
• Program creation
• Program execution
• Access to I/O devices
• Controlled access to files
• System access
• Error detection and response
• Accounting
O/S as a Resource Manager
Desirable Hardware Features
• Memory protection
—To protect the Monitor
• Timer
—To prevent a job monopolizing the system
• Privileged instructions
—Only executed by Monitor
—e.g. I/O
• Interrupts
—Allows for relinquishing and regaining control
Multi-programmed Batch Systems
• I/O devices very slow
• When one program is waiting for I/O, another can use the
CPU
Single Program
Multi-Programming with
Two Programs
Multi-Programming with
Three Programs
• Goal: connecting multiple
computers to get higher
performance
• Multiprocessors
• Scalability, availability, power
efficiency
• Task-level (process-level)
parallelism
• High throughput for independent
jobs
• Parallel processing program
• Single program run on multiple
processors
• Multicore microprocessors
• Chips with multiple processors
Hardware and Software
• Hardware
• Serial: e.g., single core processors and microcontrollers
• Parallel: e.g., multi Core processors
• Software
• Sequential: e.g., matrix multiplication
• Concurrent: e.g., operating system
• Sequential/concurrent software can run on serial/parallel
hardware
• Challenge: making effective use of parallel hardware
Parallel Programming
• Parallel software is the problem
• Need to get significant performance
improvement
• Otherwise, just use a faster uniprocessor, since it’s
easier!
• Difficulties
• Partitioning
• Coordination
• Communications overhead
Amdahl’s Law
• Sequential part can limit speedup
• Example: 100 processors, 90×
speedup?
•
Tnew = Tparallelizable/100 +
1
•
Tsequential Speedup
(1
• Solving: Fparallelizable = 0.999
• Need sequential part to be 0.1% of
original time
Scaling Example
• Workload: sum of 10 scalars, and 10 × 10
matrix sum
• Speed up from 10 to 100 processors
• Single processor: Time = (10 + 100) × tadd
• 10 processors
• Time = 10 × tadd + 100/10 × tadd = 20 × tadd
• Speedup = 110/20 = 5.5 (55% of potential)
• 100 processors
• Time = 10 × tadd + 100/100 × tadd = 11 × tadd
• Speedup = 110/11 = 10 (10% of potential)
• Assumes load can be balanced across
processors
Scaling Example (cont)
• What if matrix size is 100 × 100?
• Single processor: Time = (10 + 10000)
× tadd
• 10 processors
• Time = 10 × tadd + 10000/10 × tadd =
1010 × tadd
• Speedup = 10010/1010 = 9.9 (99% of
potential)
• 100 processors
• Time = 10 × tadd + 10000/100 × tadd =
If code can be decomposed
to limit data set size on
muli-core architectures,
better-than-Amdahl
performance can be
achieved!
Summary
• Parallel programming is hard. Doing it right is harder
• Limit sequential code
• Limit synchronization phases
• Amdahl’s law predicts the maximum speedup that can be
expected by parallelization of your code
• Better-than-Amdahl’s performance can be realized by taking
advantage of intelligent data decomposition
[ END ]
SER 450: Microprocessor Architecture
Multithreading
Multithreading
• Performing multiple threads of execution in parallel
• Replicate registers, PC, etc.
• Fast switching between threads
• Fine-grain multithreading
• Switch threads after each cycle
• Interleave instruction execution
• If one thread stalls, others are executed
• Coarse-grain multithreading
• Only switch on long stall (e.g., L2-cache miss)
• Simplifies hardware, but doesn’t hide short stalls (eg, data
hazards)
Cache size and
associativity must increase
proportional to the
hardware threading to
avoid misses due to
multithreading
You have the power to
set thread affinity to a
specific core
Setting Thread Affinity
1000
Processor
s
1000
Processor
s