Introduction to Paralel Procesing

The document provides an overview of parallel processing, focusing on the objectives and functions of operating systems, the challenges of parallel programming, and the importance of efficient resource management. It discusses multithreading, shared memory systems, and networking, highlighting the significance of data decomposition for performance scalability. Amdahl's Law is introduced to explain the limitations of speedup in parallelization, along with examples of sum reduction in both shared memory and message passing contexts.

Uploaded by

aa70525

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

9 views40 pages

Introduction to Paralel Procesing

Uploaded by

aa70525

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 40

Introduction

to Parallel
Processing
Some of the content within these notes may be derived from “Computer Organization and Design, the
Hardware/Software Interface”, Fifth Edition, by David Patterson and John Hennessy
Objectives and Functions of OS
• Convenience
—Making the computer easier to use
• Efficiency
—Allowing better use of computer resources
Layers and Views of a Computer System
Operating System Services
• Program creation
• Program execution
• Access to I/O devices
• Controlled access to files
• System access
• Error detection and response
• Accounting
O/S as a Resource Manager
Desirable Hardware Features
• Memory protection
—To protect the Monitor
• Timer
—To prevent a job monopolizing the system
• Privileged instructions
—Only executed by Monitor
—e.g. I/O
• Interrupts
—Allows for relinquishing and regaining control
Multi-programmed Batch Systems
• I/O devices very slow
• When one program is waiting for I/O, another can use the
CPU
Single Program
Multi-Programming with
Two Programs
Multi-Programming with
Three Programs
• Goal: connecting multiple
computers to get higher
performance
• Multiprocessors
• Scalability, availability, power
efficiency
• Task-level (process-level)
parallelism
• High throughput for independent
jobs
• Parallel processing program
• Single program run on multiple
processors
• Multicore microprocessors
• Chips with multiple processors
Hardware and Software
• Hardware
• Serial: e.g., single core processors and microcontrollers
• Parallel: e.g., multi Core processors
• Software
• Sequential: e.g., matrix multiplication
• Concurrent: e.g., operating system
• Sequential/concurrent software can run on serial/parallel
hardware
• Challenge: making effective use of parallel hardware
Parallel Programming
• Parallel software is the problem
• Need to get significant performance
improvement
• Otherwise, just use a faster uniprocessor, since it’s
easier!
• Difficulties
• Partitioning
• Coordination
• Communications overhead
Amdahl’s Law
• Sequential part can limit speedup
• Example: 100 processors, 90×
speedup?
•
Tnew = Tparallelizable/100 +
1
•
Tsequential Speedup
 (1
• Solving: Fparallelizable = 0.999
• Need sequential part to be 0.1% of
original time
Scaling Example
• Workload: sum of 10 scalars, and 10 × 10
matrix sum
• Speed up from 10 to 100 processors
• Single processor: Time = (10 + 100) × tadd
• 10 processors
• Time = 10 × tadd + 100/10 × tadd = 20 × tadd
• Speedup = 110/20 = 5.5 (55% of potential)
• 100 processors
• Time = 10 × tadd + 100/100 × tadd = 11 × tadd
• Speedup = 110/11 = 10 (10% of potential)
• Assumes load can be balanced across
processors
Scaling Example (cont)
• What if matrix size is 100 × 100?
• Single processor: Time = (10 + 10000)
× tadd
• 10 processors
• Time = 10 × tadd + 10000/10 × tadd =
1010 × tadd
• Speedup = 10010/1010 = 9.9 (99% of
potential)
• 100 processors
• Time = 10 × tadd + 10000/100 × tadd =
If code can be decomposed
to limit data set size on
muli-core architectures,
better-than-Amdahl
performance can be
achieved!
Summary
• Parallel programming is hard. Doing it right is harder
• Limit sequential code
• Limit synchronization phases
• Amdahl’s law predicts the maximum speedup that can be
expected by parallelization of your code
• Better-than-Amdahl’s performance can be realized by taking
advantage of intelligent data decomposition
[ END ]
SER 450: Microprocessor Architecture

Multithreading
Multithreading
• Performing multiple threads of execution in parallel
• Replicate registers, PC, etc.
• Fast switching between threads
• Fine-grain multithreading
• Switch threads after each cycle
• Interleave instruction execution
• If one thread stalls, others are executed
• Coarse-grain multithreading
• Only switch on long stall (e.g., L2-cache miss)
• Simplifies hardware, but doesn’t hide short stalls (eg, data
hazards)
Cache size and
associativity must increase
proportional to the
hardware threading to
avoid misses due to
multithreading
You have the power to
set thread affinity to a
specific core
Setting Thread Affinity

The pthread_setaffinity_np() function sets the CPU

affinity mask of the thread thread to the CPU set
pointed to by cpuset. If the call is successful, and
the thread is not currently running on one of the
CPUs in cpuset, then it is migrated to one of
those CPUs.
(from linux manual page)
[ END ]
SER 450: Microprocessor Architecture

Shared Memory Systems

and Networking
Shared Memory
• SMP: shared memory
multiprocessor
• Hardware provides single
physical address space for all
processors
• Synchronize shared variables
using locks
• Memory access time
• UMA (uniform) vs. NUMA
(nonuniform)
Example: Sum Reduction
• Sum 100,000 numbers on 100 processor
UMA
• Each processor has ID: 0 ≤ Pn ≤ 99
• Partition 1000 numbers per processor
• Initial summation on each processor
sum[Pn] = 0;
for (i = 1000*Pn;
i < 1000*(Pn+1); i = i + 1)
sum[Pn] = sum[Pn] + A[i];
• Now need to add these partial
sums
• Reduction: divide and conquer
• Half the processors add pairs,
Example: Sum Reduction
half = 100;
repeat
synch();
if (half%2 != 0 && Pn == 0) sum[0]
= sum[0] + sum[half-1];
/* Conditional sum needed when half
is odd;
Processor0 gets missing element
*/
half = half/2; /* dividing line on who sums */ if (Pn < half)
sum[Pn] = sum[Pn] + sum[Pn+half];
until (half == 1);
Message Passing
• Each processor has private physical address space
• Hardware sends/receives messages between
processors
Loosely Coupled Clusters
• Network of independent computers
• Each has private memory and OS
• Connected using I/O system
• E.g., Ethernet/switch, Internet
• Suitable for applications with independent
tasks
• Web servers, databases, simulations, …
• High availability, scalable, affordable
• Problems
• Administration cost (prefer virtual machines)
• Low interconnect bandwidth
• c.f. processor/memory bandwidth on an SMP
Sum Reduction (Again)
• Sum 100,000 on 100 processors
• First distribute 100 numbers to each
• The do partial sums
sum = 0;
for (i = 0; i<1000; i = i + 1)
sum = sum + AN[i];
• Reduction
• Half the processors send, other half receive
and add
• The quarter send, quarter receive and add, …
Sum Reduction (Again)
• Given send() and receive() operations
limit = 100; half = 100;/* 100 processors */ repeat
half = (half+1)/2; /* send vs. receive
dividing line */
if (Pn >= half && Pn < limit)
send(Pn - half, sum); if
(Pn < (limit/2))
sum = sum +
receive();
limit = half; /* upper limit of senders */
until (half == 1); /* exit with final sum */

• Send/receive also provide

synchronization
• Assumes send/receive take similar time
to addition
Network Characteristics
• Performance
• Latency per message (unloaded
network)
• Throughput
• Link bandwidth
• Total network bandwidth
• Bisection bandwidth
• Congestion delays (depending on
traffic)
• Cost
• Power
• Routability in silicon
Difference between latency and bandwidth
Sources of Latency
• Switching Latency
• Caused by switching/routing hardware
• In the range of 100s of nanoseconds.
• Protocol Overhead
• Every packet sent over the network has a small amount of code that must
be executed before the packet is sent
• Typical packet latencies range between 1uSec and 100mSec depending on
the protocol
• Transmission latency
• The time it takes to transmit all the bytes in the packet over the network
• Impacted by packet size and network bandwidth
Scalability and Networking – It’s about data

1000
Processor
s

• 100,000,000 bytes to process at 10ns per byte

• Zero switching latency
• 5uSec protocol overhead/latency
• 100M bytes/second network capability to/from each node
• Maximum packet size = 1000 bytes of payload
• Every processor must share 1/nth of their data with all other
Scalability and Networking – It’s about data

1000
Processor
s

Computation 1E910e-9 = 10 seconds ¼ 10 seconds = 2.5 seconds (1/1000) * 10 = 0.01 seconds

Protocol 0 5e-6 * ¾ * 1e9 * / 1000 = 3.75 5e-6 * (999/1000) * 1e9 / 1000 = 5

Overhea seconds seconds
d
Transmissio 0 ¾ *1e9 / 100e6 = 7.5 seconds (999/1000) * 1e9 / 100e6 = 9.99 seconds
n Time
Total 10 Seconds 13.75 seconds 17.485 seconds
Summary
• Shared memory and networking can be used to scale
performance
through parallelism
• Data decomposition is key to performance scalability
[ END ]

Lecture-13-14 Parallel and Distributed Systems Programming Models-Jameel
No ratings yet
Lecture-13-14 Parallel and Distributed Systems Programming Models-Jameel
70 pages
L 1 ParallelProcess Challenges
No ratings yet
L 1 ParallelProcess Challenges
82 pages
PDS Merged
No ratings yet
PDS Merged
182 pages
hpc_parallel
No ratings yet
hpc_parallel
122 pages
3.Introduction to Parallelism
No ratings yet
3.Introduction to Parallelism
64 pages
Patterson6e_MIPS_Ch06_PPT(2) (1)
No ratings yet
Patterson6e_MIPS_Ch06_PPT(2) (1)
74 pages
HPC BOOk
No ratings yet
HPC BOOk
68 pages
CS-3006_2_PDC_Overview_compressed
No ratings yet
CS-3006_2_PDC_Overview_compressed
107 pages
2.ParallelArchExec
No ratings yet
2.ParallelArchExec
46 pages
Untitled document (3)
No ratings yet
Untitled document (3)
63 pages
Week 04 Lecture Chapter 4
No ratings yet
Week 04 Lecture Chapter 4
45 pages
Chapter Three parallel computing
No ratings yet
Chapter Three parallel computing
44 pages
08 Systems Programming-Concurrent Programming
No ratings yet
08 Systems Programming-Concurrent Programming
61 pages
Lecture 05
No ratings yet
Lecture 05
73 pages
Presentation 3
No ratings yet
Presentation 3
63 pages
Arch13 Multiprocessors Afterlecture
No ratings yet
Arch13 Multiprocessors Afterlecture
70 pages
Unit VI Parallel Programming Concepts
No ratings yet
Unit VI Parallel Programming Concepts
90 pages
PDC Lecture 02
No ratings yet
PDC Lecture 02
35 pages
Screenshot 2024-12-05 at 2.01.32 PM
No ratings yet
Screenshot 2024-12-05 at 2.01.32 PM
49 pages
Untitled document (2)
No ratings yet
Untitled document (2)
39 pages
BDS-Session-2
No ratings yet
BDS-Session-2
58 pages
Parallel Programming Module 1
No ratings yet
Parallel Programming Module 1
71 pages
2nd
No ratings yet
2nd
19 pages
1 Introduction
No ratings yet
1 Introduction
30 pages
CS439 CC 2 Parallel Distributed Systems[1]
No ratings yet
CS439 CC 2 Parallel Distributed Systems[1]
37 pages
04 Process Con
No ratings yet
04 Process Con
26 pages
BDS Session 2
No ratings yet
BDS Session 2
56 pages
CS326 Parallel and Distributed Computing: SPRING 2021 National University of Computer and Emerging Sciences
No ratings yet
CS326 Parallel and Distributed Computing: SPRING 2021 National University of Computer and Emerging Sciences
47 pages
CA Chap7 Multicores Multiprocessors
No ratings yet
CA Chap7 Multicores Multiprocessors
42 pages
UNIT-2 ACA
No ratings yet
UNIT-2 ACA
24 pages
BDS Session 2
No ratings yet
BDS Session 2
58 pages
Parallel_computing
No ratings yet
Parallel_computing
32 pages
What Is Serial Computing?: Traditionally, Software Has Been Written For Serial Computation
No ratings yet
What Is Serial Computing?: Traditionally, Software Has Been Written For Serial Computation
22 pages
Cloud Computing CS 15-319: Programming Models-Part I Lecture 4, Jan 25, 2012
No ratings yet
Cloud Computing CS 15-319: Programming Models-Part I Lecture 4, Jan 25, 2012
40 pages
001__DDS-IIIT-Jan-10th
No ratings yet
001__DDS-IIIT-Jan-10th
34 pages
multicore02-2
No ratings yet
multicore02-2
18 pages
Group 2 Assignment 1
No ratings yet
Group 2 Assignment 1
10 pages
HPC Overview
No ratings yet
HPC Overview
45 pages
Introduction To Parallel Processing and Distributed Systems
No ratings yet
Introduction To Parallel Processing and Distributed Systems
15 pages
Introduction To Parallel Programming: Linda Woodard CAC 19 May 2010
100% (1)
Introduction To Parallel Programming: Linda Woodard CAC 19 May 2010
38 pages
Parallel Programming: Sathish S. Vadhiyar Course Web Page
No ratings yet
Parallel Programming: Sathish S. Vadhiyar Course Web Page
36 pages
Intro Parallel Programming 2015
No ratings yet
Intro Parallel Programming 2015
38 pages
Pipelining vs. Parallel Processing
No ratings yet
Pipelining vs. Parallel Processing
23 pages
Parallel Processors From Client To Cloud: Omputer Rganization and Esign
No ratings yet
Parallel Processors From Client To Cloud: Omputer Rganization and Esign
43 pages
CS326 Parallel and Distributed Computing: SPRING 2021 National University of Computer and Emerging Sciences
No ratings yet
CS326 Parallel and Distributed Computing: SPRING 2021 National University of Computer and Emerging Sciences
33 pages
HPC Lectures 1 5
No ratings yet
HPC Lectures 1 5
18 pages
RS_PDS-OE 3010
No ratings yet
RS_PDS-OE 3010
8 pages
Multicores, Multiprocessors, and P, Clusters
No ratings yet
Multicores, Multiprocessors, and P, Clusters
51 pages
CS Chap7 Multicores Multiprocessors Clusters
No ratings yet
CS Chap7 Multicores Multiprocessors Clusters
65 pages
Introduction To Parallel Programming
No ratings yet
Introduction To Parallel Programming
129 pages
Whitepaper Imsl Increase Performance Parallel Programming Numerical Libraries
No ratings yet
Whitepaper Imsl Increase Performance Parallel Programming Numerical Libraries
8 pages
Memory in Multiprocessor System
No ratings yet
Memory in Multiprocessor System
52 pages
24-25 - Parallel Processing PDF
No ratings yet
24-25 - Parallel Processing PDF
36 pages
DFT Flow
100% (2)
DFT Flow
18 pages
Lec7 PDF
No ratings yet
Lec7 PDF
16 pages
Parallel Computing: Overview: John Urbanic Urbanic@psc - Edu
No ratings yet
Parallel Computing: Overview: John Urbanic Urbanic@psc - Edu
34 pages
Multiprocessors - Parallel Processing Overview: "The Real World Is Inherently Concurrent Yet Our Computational
No ratings yet
Multiprocessors - Parallel Processing Overview: "The Real World Is Inherently Concurrent Yet Our Computational
78 pages
Unit1 Notes
No ratings yet
Unit1 Notes
27 pages
Case study digital electronics cu
No ratings yet
Case study digital electronics cu
2 pages
maximum , minimum & timing diagram of 8086 microprocessor_3
No ratings yet
maximum , minimum & timing diagram of 8086 microprocessor_3
9 pages
CS 133 Parallel & Distributed Computing: Course Instructor: Adam Kaplan Lecture #1: 4/2/2012
No ratings yet
CS 133 Parallel & Distributed Computing: Course Instructor: Adam Kaplan Lecture #1: 4/2/2012
22 pages
Chapter-2-Computer-System-Hardware
No ratings yet
Chapter-2-Computer-System-Hardware
44 pages
SC2005 09
No ratings yet
SC2005 09
39 pages
Sdffsadadsf
No ratings yet
Sdffsadadsf
226 pages
DLD Combination Logic
No ratings yet
DLD Combination Logic
136 pages
PW 1979 02
No ratings yet
PW 1979 02
93 pages
TTL CMOS Usados en Electrónica
No ratings yet
TTL CMOS Usados en Electrónica
17 pages
Vertical Horizons: 6-Variable K-Map - Karnaugh Map in Digital Electronics Tutorial Part 7
No ratings yet
Vertical Horizons: 6-Variable K-Map - Karnaugh Map in Digital Electronics Tutorial Part 7
5 pages
HP UNIX Cheat Sheet
100% (1)
HP UNIX Cheat Sheet
39 pages
Microcontrollers PDF
No ratings yet
Microcontrollers PDF
33 pages
Flip Flop
No ratings yet
Flip Flop
29 pages
CPH Micro Project PDF
No ratings yet
CPH Micro Project PDF
10 pages
INTEL 8259A Programmable Interrupt Controller
No ratings yet
INTEL 8259A Programmable Interrupt Controller
15 pages
Data Hazards GATE Notes
No ratings yet
Data Hazards GATE Notes
1 page
UWBR Matlab Ise
No ratings yet
UWBR Matlab Ise
20 pages
Chapter 1
No ratings yet
Chapter 1
42 pages
Computer Architecture Study Guide (Draft - Pending Final Review) PDF
No ratings yet
Computer Architecture Study Guide (Draft - Pending Final Review) PDF
93 pages
Multiplexer/Decoder: 41.what Is Mux, Demux, and Decoder?
No ratings yet
Multiplexer/Decoder: 41.what Is Mux, Demux, and Decoder?
4 pages
Half-Adder:: One-Bit Binary Numbers
No ratings yet
Half-Adder:: One-Bit Binary Numbers
6 pages
Steps To A Safe and Successful Disassembly of A System Unit
No ratings yet
Steps To A Safe and Successful Disassembly of A System Unit
9 pages
SWE 312: Theory of Computation: Fatama Binta Rafiq (FBR)
No ratings yet
SWE 312: Theory of Computation: Fatama Binta Rafiq (FBR)
14 pages
Ultra Low-Power Clocking Scheme Using Energy Recovery and Clock Gating
No ratings yet
Ultra Low-Power Clocking Scheme Using Energy Recovery and Clock Gating
12 pages
Handout 1
No ratings yet
Handout 1
6 pages
Digital Definition All Lesson
No ratings yet
Digital Definition All Lesson
8 pages
Dixesh Resume
No ratings yet
Dixesh Resume
4 pages
Andromeda A6
No ratings yet
Andromeda A6
37 pages
Timing
No ratings yet
Timing
15 pages
Profound Linux For Users
From Everand
Profound Linux For Users
Onder Teker
No ratings yet
CompTIA Security+: Network Attacks
From Everand
CompTIA Security+: Network Attacks
AS Snipes
5/5 (1)
Computer Networking: An introductory guide for complete beginners: Computer Networking, #1
From Everand
Computer Networking: An introductory guide for complete beginners: Computer Networking, #1
Ramon Nastase
4.5/5 (2)

Introduction to Paralel Procesing

Uploaded by

Introduction to Paralel Procesing

Uploaded by

Introduction

The pthread_setaffinity_np() function sets the CPU

Shared Memory Systems

• Send/receive also provide

• 100,000,000 bytes to process at 10ns per byte

Computation 1E9*10e-9 = 10 seconds ¼ * 10 seconds = 2.5 seconds (1/1000) * 10 = 0.01 seconds

Protocol 0 5e-6 * ¾ * 1e9 * / 1000 = 3.75 5e-6 * (999/1000) * 1e9 / 1000 = 5

You might also like

Computation 1E910e-9 = 10 seconds ¼ 10 seconds = 2.5 seconds (1/1000) * 10 = 0.01 seconds