0% found this document useful (0 votes)

35 views

Advanced Computer Architecture Fall 2019 Multithreaded Architectures

Uploaded by

YAAKOV SOLOMON

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

35 views

Advanced Computer Architecture Fall 2019 Multithreaded Architectures

Uploaded by

YAAKOV SOLOMON

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 31

Advanced Computer Architecture

Fall 2019

Multithreaded Architectures
(Applied Parallel Programming)

Lecture 1: Introduction
1
Multicore processors

2
Multicore processors
• The benefit of having multiple cores is that the
system can handle multiple threads simultaneously
• Each core can handle a separate stream of data.
• This architecture greatly increases performance of a
system that is running parallel applications.

3
SMT-
Simultaneous
Multitheading

CMP-Chip
Multiprocessor
Multithreading cores
• A small number of threads
• Limited parallel processing capability
• Each thread runs a separate program

5
Massive Multithreading
• Very large number of threads (thousands)
• Massive parallelism
• SPMD model: Single Program Multiple Data
– Also called Data Parallelism
• Two kinds of threads:
– H/W threads─very fast switching (state kept in registers)
– S/W threads ─slower switching (thousands clock cycle)
6
Course Goals
• Learn to program massively parallel processors and achieve
– high performance

• Technical subjects
– principles of parallel algorithms
– processor architecture features and constraints
– programming techniques

7
Course Outline
1. Introduction
2. Data Parallel Computing
3. Scalable Parallel Execution
4. Memory and Data Locality
5. Performance Considerations
6. Numerical considerations
7. Parallel patterns: convolution
8. Parallel patterns: prefix sum
9. Parallel patterns: parallel histogram computation
10. Parallel patterns: sparse matrix computation
11. Parallel patterns: merge sort
12. Parallel patterns: graph search
13. Parallel processing with DSPs (and/or selected papers if time allows)
8
People
• Instructor:
Prof Shlomo Weiss
Office hours: by appointment
Office: rm 143 EE Maabadot building 1st floor
Mail: [email protected]

9
Web Resources
• Website: moodle
– Lecture slides (see notes for more details)
– Homework

– HW solution
– Forum for Q&A - your classmates often have answers

10
Grading
• Attendance mandatory!
• Exam: 85%
• Weekly homework: 15% (~1.7% per homework)
• Two weeks to upload your solution to moodle (check deadlines)
• Make sure to click on the submission button (otherwise it is a draft)
• Late homework will not be accepted!
• Submissions must be printed (handwritten homework not
acceptable)
• Homework done individually

11
Hands-on Homework
Three choices to run your programs on a CUDA device:
1. PCs available in the following class rooms:
– Binyan Tochna class rooms: 010, 001
– Binyan Kitot class rooms: 003
– Visual Studio and the NVIDIA SDK are installed on these PCs.
2. If you have a CUDA device on your PC or laptop you may do your
programming assignments at home. You’ll need visual studio
3. https://medium.com/@
iphoenix179/running-cuda-c-c-in-jupyter-or-how-to-run-nvcc-in-googl
e-colab-663d33f53772

12
Getting started on CUDA
For your first homework assignment:
• Open
https://docs.nvidia.com/cuda/cuda-quick-start-guide/index.h
tml
• Follow the instructions in the cuda-quick-start-guide
• Select the “nbody” sample from those provided with the
CUDA installation.
• Run nbody.exe.
• Hand in a screenshot of “nbody” execution.
13
Academic Honesty
• You are allowed and encouraged to discuss assignments
with other students in the class. Getting verbal advice/help
from people who’ve already taken the course is also fine.
• Any reference to assignments from previous terms or web
postings is unacceptable
• Any copying of non-trivial code is unacceptable
– Non-trivial = more than a line or so
– Includes reading someone else’s code and then going off to write
your own.

14
Academic Honesty (cont.)
• Giving/receiving help on an exam is unacceptable

• Penalties for academic dishonesty:

– Zero on the assignment for the first occasion
– Automatic failure of the course for repeat offenses

15
Text
1. D. Kirk and W. Hwu, “Programming Massively
Parallel Processors – A Hands-on Approach,”
Morgan Kaufman Publisher, 3rd edition, 2016

16
Biomolecular simulation: a computational
microscope for molecular biology
• Large clusters (scale out) allow simulation of biological systems of
realistic space dimensions
– Interesting biological systems have dimensions of mm or larger
– Thousands of nodes are required to hold and update all the grid points.
• Fast nodes (scale up) allow simulation at realistic time scales
– Simulation time steps at femtosecond (10-15 second) level needed for
accuracy
– Biological processes take miliseconds or longer
– Current molecular dynamics simulations progress at about one day for
each 10-100 microseconds of the simulated process.
Blue Waters Breakthrough---computational biology
 Determination of the structure of the HIV capsid
at atomic-level
 Collaborative effort of experimental groups at the
U. of Pittsburgh and Vanderbilt U., and the
Schulten’s computational team at the U. of
Illinois.
 64-million-atom HIV capsid simulation of the
process through which the capsid disassembles,
releasing its genetic material
 a critical step in understanding HIV infection and
finding a target for antiviral drugs.
Blue Waters Computing System
Operational at Illinois since 3/201349,504 CPUs -- 4,224 GPUs

12.5 PF IB Switch >1 TB/sec

10/40/100 Gb
1.6 PB DRAM Ethernet Switch
$250M 100 GB/sec
120+ Gb/sec

WAN Spectra Logic: 300 PBs Sonexion: 26 PBs

Qualcomm SoC for Mobile

20
Frequency Scaled Too Fast 1993-2003

10000

Clock Frequency (MHz)

1000

100

10
85 87 89 91 93 95 97 99 01 03 05
Total Processor Power Increased
(super-scaling of frequency and chip size)

100

1
85 87 89 91 93 95 97 99 01 03
CPUs: Latency Oriented Design
 High clock frequency
 Large caches
– Convert long latency memory accesses ALU ALU
Control
to short latency cache accesses ALU ALU
 Sophisticated control CPU
– Branch prediction for reduced branch Cache

latency
– Data forwarding for reduced data DRAM
latency
 Powerful ALU
– Reduced operation latency
GPUs: Throughput Oriented Design
 Moderate clock frequency
 Small caches
– To boost memory throughput
 Simple control
– No branch prediction GPU
– No data forwarding
 Energy efficient ALUs
– Many, long latency but heavily pipelined DRAM
for high throughput
 Require massive number of threads to
tolerate latencies
Winning Strategies Use Both
CPU and GPU
• CPUs for sequential • GPUs for parallel parts
parts where latency where throughput wins
matters – GPUs can be 10+ faster
– CPUs can be 10+ faster than CPUs for parallel
than GPUs for sequential code
code

25
Heterogeneous parallel computing is catching on.
Data
Financial Scientific Engineering Medical
Intensive
Analysis Simulation Simulation Imaging
Analytics
Electronic
Digital Audio Digital Video Computer Biomedical
Design
Processing Processing Vision Informatics
Automation

Statistical Ray Tracing Interactive Numerical

Modeling Rendering Physics Methods

• 280 submissions to GPU Computing Gems

– 110 articles included in two volumes
26
Massive
Parallelism -
Regularity

© David Kirk/NVIDIA
27
and Wen-mei W. Hwu,
Load Balance
• The total amount of time to complete a parallel job is limited by the thread that
takes the longest to finish

good bad!

ECE408/CS483, University of Illinois, Urbana-Champaign
Global Memory Bandwidth
Ideal Reality

ECE408/CS483, University of Illinois, Urbana-Champaign
Conflicting Data Accesses Cause Serialization and Delays

• Massively parallel
execution cannot
afford serialization

• Contentions in accessing
critical data causes
serialization

30
Summary─Massive Parallel Processing
• Applications must have parallelism
• Balanced load
• High bandwidth to memory
• Few conflicts (and serialization)

History of Microcontrollers
100% (2)
History of Microcontrollers
2 pages
Node.js 63 Interview Questions and Answers
From Everand
Node.js 63 Interview Questions and Answers
John Edward Cooper Berg
No ratings yet
Gujarat Technological University: W.E.F. AY 2018-19
No ratings yet
Gujarat Technological University: W.E.F. AY 2018-19
3 pages
217 Lec1
No ratings yet
217 Lec1
35 pages
Programming For Graphics Processing Units (Gpus) : Parallel
No ratings yet
Programming For Graphics Processing Units (Gpus) : Parallel
35 pages
CSE5006 Multicore-Architectures ETH 1 AC41
No ratings yet
CSE5006 Multicore-Architectures ETH 1 AC41
9 pages
Introduction To Massively Parallel Computing
No ratings yet
Introduction To Massively Parallel Computing
44 pages
COSC 4101 Parallel and Distributed Computing Final
No ratings yet
COSC 4101 Parallel and Distributed Computing Final
4 pages
PP Cuda Unit1 1
No ratings yet
PP Cuda Unit1 1
77 pages
Get GPU Parallel Program Development Using CUDA 1st Edition Tolga Soyata free all chapters
100% (2)
Get GPU Parallel Program Development Using CUDA 1st Edition Tolga Soyata free all chapters
65 pages
An Approach To Parallel Processing: Yashraj Rai Puja Padiya
No ratings yet
An Approach To Parallel Processing: Yashraj Rai Puja Padiya
3 pages
Chapter 5 - General Purpose PGPU, CUDA
No ratings yet
Chapter 5 - General Purpose PGPU, CUDA
70 pages
Gpu Programming
100% (2)
Gpu Programming
96 pages
Introduction To GP-GPU and CUDA: High Performance Computing Center Hanoi University of Science & Technology
No ratings yet
Introduction To GP-GPU and CUDA: High Performance Computing Center Hanoi University of Science & Technology
43 pages
Kirk+Hwu GPU
No ratings yet
Kirk+Hwu GPU
92 pages
Lecture 0: Cpus and Gpus: Prof. Mike Giles
No ratings yet
Lecture 0: Cpus and Gpus: Prof. Mike Giles
36 pages
L 3 GPU
No ratings yet
L 3 GPU
33 pages
Instant Download GPU Parallel Program Development Using CUDA 1st Edition Tolga Soyata PDF All Chapters
100% (5)
Instant Download GPU Parallel Program Development Using CUDA 1st Edition Tolga Soyata PDF All Chapters
55 pages
Lecture 1
No ratings yet
Lecture 1
17 pages
2023 CSC14120 Lecture00 CourseIntroduction
No ratings yet
2023 CSC14120 Lecture00 CourseIntroduction
30 pages
Parralel Demro 001
No ratings yet
Parralel Demro 001
45 pages
SMM Cap1
No ratings yet
SMM Cap1
101 pages
Handbook HPC 23-24
No ratings yet
Handbook HPC 23-24
18 pages
1
No ratings yet
1
44 pages
Cuda
No ratings yet
Cuda
69 pages
Introduction To CUDA
No ratings yet
Introduction To CUDA
51 pages
Bader Large Scale Earthquake Simulation On Supercomputing Platforms
No ratings yet
Bader Large Scale Earthquake Simulation On Supercomputing Platforms
60 pages
GPGPU
No ratings yet
GPGPU
139 pages
W3C1 Principles of Parallel Computing
No ratings yet
W3C1 Principles of Parallel Computing
28 pages
Parallel Computing
No ratings yet
Parallel Computing
57 pages
Lec 01
No ratings yet
Lec 01
2 pages
Gpu Parallel Program Development Cuda
100% (1)
Gpu Parallel Program Development Cuda
477 pages
chapter-8
No ratings yet
chapter-8
58 pages
M P S E: Ulticore Rocessors For Cience AND Ngineering
No ratings yet
M P S E: Ulticore Rocessors For Cience AND Ngineering
5 pages
Comp422 2011 Lecture1 Introduction
No ratings yet
Comp422 2011 Lecture1 Introduction
50 pages
Parralel 01
No ratings yet
Parralel 01
38 pages
Lecture 2
No ratings yet
Lecture 2
15 pages
HPC-Unit-1
No ratings yet
HPC-Unit-1
65 pages
IntroGPUs
No ratings yet
IntroGPUs
36 pages
001__DDS-IIIT-Jan-10th
No ratings yet
001__DDS-IIIT-Jan-10th
34 pages
Lecture 1
No ratings yet
Lecture 1
19 pages
FALLSEM2021-22 CSE4001 ETH VL2021220104078 Reference Material I 03-Aug-2021 Lecture1-Course Introduction
No ratings yet
FALLSEM2021-22 CSE4001 ETH VL2021220104078 Reference Material I 03-Aug-2021 Lecture1-Course Introduction
39 pages
1.Introduction
No ratings yet
1.Introduction
65 pages
Overview of Parallel Computing: Shawn T. Brown
No ratings yet
Overview of Parallel Computing: Shawn T. Brown
46 pages
High Performance Computing: Sabah Sayed
No ratings yet
High Performance Computing: Sabah Sayed
22 pages
Lec1 Introduction
No ratings yet
Lec1 Introduction
23 pages
UNIT-4
No ratings yet
UNIT-4
48 pages
Design of Parallel Algorithm'S: Faculty Guide: Group Members
No ratings yet
Design of Parallel Algorithm'S: Faculty Guide: Group Members
49 pages
ME Lab - II Sem Syllabus
No ratings yet
ME Lab - II Sem Syllabus
6 pages
Dept. of Electrical Engineering COMSATS University Islamabad Fall 2018
No ratings yet
Dept. of Electrical Engineering COMSATS University Islamabad Fall 2018
26 pages
Parallel ProgrammingSyllabus
No ratings yet
Parallel ProgrammingSyllabus
2 pages
Chapter2 part 3
No ratings yet
Chapter2 part 3
27 pages
Lec1 and 2
No ratings yet
Lec1 and 2
52 pages
Theory of Distributed Computing and Parallel Processing With Its Applications, Advantages and Disadvantages
No ratings yet
Theory of Distributed Computing and Parallel Processing With Its Applications, Advantages and Disadvantages
11 pages
High Performance Computing Lecture 1 HPC Public
No ratings yet
High Performance Computing Lecture 1 HPC Public
50 pages
P 1
No ratings yet
P 1
44 pages
Parallel Computation Lecture Notes
No ratings yet
Parallel Computation Lecture Notes
44 pages
p1
No ratings yet
p1
30 pages
HPC-Unit-2
No ratings yet
HPC-Unit-2
72 pages
Cours 1
No ratings yet
Cours 1
38 pages
Quantum Computer Vs Traditional Computer
From Everand
Quantum Computer Vs Traditional Computer
Arief Muinnudin
No ratings yet
No Proca Photons
No ratings yet
No Proca Photons
6 pages
Introduction To Random Signals and Noise - 2005 - Van Etten - Appendix F The Q and Erfc Functions
No ratings yet
Introduction To Random Signals and Noise - 2005 - Van Etten - Appendix F The Q and Erfc Functions
2 pages
Global Entrainment of Transcriptional Systems To Periodic Inputs
No ratings yet
Global Entrainment of Transcriptional Systems To Periodic Inputs
26 pages
Comment On Crystallographic, Spectroscopic, Thermal, Optical Physics
No ratings yet
Comment On Crystallographic, Spectroscopic, Thermal, Optical Physics
4 pages
Design of 2 4 GHZ Mmic Feed Forward Ampl
No ratings yet
Design of 2 4 GHZ Mmic Feed Forward Ampl
7 pages
Introduction To Time-Delay and Sampled-Data Systems
No ratings yet
Introduction To Time-Delay and Sampled-Data Systems
6 pages
Ieee 802.11ac Wlan Simulation in Matlab
No ratings yet
Ieee 802.11ac Wlan Simulation in Matlab
6 pages
Specification Parameters of WLAN Performance With MATLAB Simulink Model of IEEE 802.11
No ratings yet
Specification Parameters of WLAN Performance With MATLAB Simulink Model of IEEE 802.11
10 pages
Multithreaded Architectures: Lecture 5: Performance Considerations
No ratings yet
Multithreaded Architectures: Lecture 5: Performance Considerations
49 pages
Multithreaded Architectures: (Applied Parallel Programming)
No ratings yet
Multithreaded Architectures: (Applied Parallel Programming)
29 pages
Chapter 3
No ratings yet
Chapter 3
20 pages
Multithreaded Architectures: Memory and Data Locality
No ratings yet
Multithreaded Architectures: Memory and Data Locality
39 pages
Improving GPU Performance Via Large Warps and Two
No ratings yet
Improving GPU Performance Via Large Warps and Two
11 pages
ILP-Architectures Part I
No ratings yet
ILP-Architectures Part I
56 pages
ILP-Architectures Part III
No ratings yet
ILP-Architectures Part III
49 pages
A Hybrid Register Cache For GPUs
No ratings yet
A Hybrid Register Cache For GPUs
11 pages
Snoopy Protocols: Design Issues, Split-Transaction Bus, TLB Coherence
No ratings yet
Snoopy Protocols: Design Issues, Split-Transaction Bus, TLB Coherence
11 pages
4 1 MWagner GPU Volta
No ratings yet
4 1 MWagner GPU Volta
36 pages
Anul Lansarii Denumire Chip Numar Tranzistoare: Legea Lui Moore: 2010 - 2020
No ratings yet
Anul Lansarii Denumire Chip Numar Tranzistoare: Legea Lui Moore: 2010 - 2020
3 pages
03 Buses
No ratings yet
03 Buses
14 pages
Prof. Atal Chaudhuri
No ratings yet
Prof. Atal Chaudhuri
89 pages
Co&a Unit 3
No ratings yet
Co&a Unit 3
63 pages
Ponyprog Circuit For AVR&amp PIC16F84
No ratings yet
Ponyprog Circuit For AVR&amp PIC16F84
6 pages
LP Micro Controller
No ratings yet
LP Micro Controller
4 pages
Data Hazards GATE Notes
No ratings yet
Data Hazards GATE Notes
1 page
Keypad Interfacing
No ratings yet
Keypad Interfacing
6 pages
Computer Hardwares Basic - Inside The Box
No ratings yet
Computer Hardwares Basic - Inside The Box
33 pages
L6 - Cso 1
No ratings yet
L6 - Cso 1
29 pages
History of Operating Systems
No ratings yet
History of Operating Systems
4 pages
UNIT 4 Mechatronics
100% (1)
UNIT 4 Mechatronics
21 pages
OS Unit 1 KCA - 203
No ratings yet
OS Unit 1 KCA - 203
22 pages
What Are The Requirements For Blockchain Hardware
No ratings yet
What Are The Requirements For Blockchain Hardware
6 pages
Syllabus FOR M. SC. Computer Science
No ratings yet
Syllabus FOR M. SC. Computer Science
31 pages
New Catalogue
No ratings yet
New Catalogue
26 pages
Course Code: Course Title: Credit Structure (L-T-P-C) :: CS403 Computer Organization and Architecture 3-1-0-4
No ratings yet
Course Code: Course Title: Credit Structure (L-T-P-C) :: CS403 Computer Organization and Architecture 3-1-0-4
230 pages
Arduino Introduction and Advanced Resour
No ratings yet
Arduino Introduction and Advanced Resour
94 pages
Chapter 3
No ratings yet
Chapter 3
50 pages
Md. Iftekharul Islam Sakib: Lecturer Cse, Buet
No ratings yet
Md. Iftekharul Islam Sakib: Lecturer Cse, Buet
38 pages
Imporatnt Terms N Conditions of Tender For Stores Supply Contract Wef 14nov2014
No ratings yet
Imporatnt Terms N Conditions of Tender For Stores Supply Contract Wef 14nov2014
66 pages
Ladder Program Converter
No ratings yet
Ladder Program Converter
258 pages
[OS'25] Lecture 1
No ratings yet
[OS'25] Lecture 1
64 pages
CO2: 1. Concept of Program Execution/Interpretation
No ratings yet
CO2: 1. Concept of Program Execution/Interpretation
22 pages
Input and Output
No ratings yet
Input and Output
38 pages
ARM Prog Model 1
No ratings yet
ARM Prog Model 1
29 pages
Ece-I-Computer Concepts & C Programming (10ccp-13) - Notes
No ratings yet
Ece-I-Computer Concepts & C Programming (10ccp-13) - Notes
105 pages
Evaluation of Os
No ratings yet
Evaluation of Os
5 pages
3D-P-Hornet-Intelligent-Endpoint-Datasheet
No ratings yet
3D-P-Hornet-Intelligent-Endpoint-Datasheet
2 pages

Advanced Computer Architecture Fall 2019 Multithreaded Architectures

Uploaded by

Advanced Computer Architecture Fall 2019 Multithreaded Architectures

Uploaded by

Advanced Computer Architecture

• Penalties for academic dishonesty:

12.5 PF IB Switch >1 TB/sec

WAN Spectra Logic: 300 PBs Sonexion: 26 PBs

Clock Frequency (MHz)

Statistical Ray Tracing Interactive Numerical

• 280 submissions to GPU Computing Gems

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2016 28

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2016 29

You might also like