Advanced Computer Architecture Fall 2019 Multithreaded Architectures
Advanced Computer Architecture Fall 2019 Multithreaded Architectures
Fall 2019
Multithreaded Architectures
(Applied Parallel Programming)
Lecture 1: Introduction
1
Multicore processors
2
Multicore processors
• The benefit of having multiple cores is that the
system can handle multiple threads simultaneously
• Each core can handle a separate stream of data.
• This architecture greatly increases performance of a
system that is running parallel applications.
3
SMT-
Simultaneous
Multitheading
CMP-Chip
Multiprocessor
Multithreading cores
• A small number of threads
• Limited parallel processing capability
• Each thread runs a separate program
5
Massive Multithreading
• Very large number of threads (thousands)
• Massive parallelism
• SPMD model: Single Program Multiple Data
– Also called Data Parallelism
• Two kinds of threads:
– H/W threads─very fast switching (state kept in registers)
– S/W threads ─slower switching (thousands clock cycle)
6
Course Goals
• Learn to program massively parallel processors and achieve
– high performance
• Technical subjects
– principles of parallel algorithms
– processor architecture features and constraints
– programming techniques
7
Course Outline
1. Introduction
2. Data Parallel Computing
3. Scalable Parallel Execution
4. Memory and Data Locality
5. Performance Considerations
6. Numerical considerations
7. Parallel patterns: convolution
8. Parallel patterns: prefix sum
9. Parallel patterns: parallel histogram computation
10. Parallel patterns: sparse matrix computation
11. Parallel patterns: merge sort
12. Parallel patterns: graph search
13. Parallel processing with DSPs (and/or selected papers if time allows)
8
People
• Instructor:
Prof Shlomo Weiss
Office hours: by appointment
Office: rm 143 EE Maabadot building 1st floor
Mail: [email protected]
9
Web Resources
• Website: moodle
– Lecture slides (see notes for more details)
– Homework
– HW solution
– Forum for Q&A - your classmates often have answers
10
Grading
• Attendance mandatory!
• Exam: 85%
• Weekly homework: 15% (~1.7% per homework)
• Two weeks to upload your solution to moodle (check deadlines)
• Make sure to click on the submission button (otherwise it is a draft)
• Late homework will not be accepted!
• Submissions must be printed (handwritten homework not
acceptable)
• Homework done individually
11
Hands-on Homework
Three choices to run your programs on a CUDA device:
1. PCs available in the following class rooms:
– Binyan Tochna class rooms: 010, 001
– Binyan Kitot class rooms: 003
– Visual Studio and the NVIDIA SDK are installed on these PCs.
2. If you have a CUDA device on your PC or laptop you may do your
programming assignments at home. You’ll need visual studio
3. https://medium.com/@
iphoenix179/running-cuda-c-c-in-jupyter-or-how-to-run-nvcc-in-googl
e-colab-663d33f53772
12
Getting started on CUDA
For your first homework assignment:
• Open
https://docs.nvidia.com/cuda/cuda-quick-start-guide/index.h
tml
• Follow the instructions in the cuda-quick-start-guide
• Select the “nbody” sample from those provided with the
CUDA installation.
• Run nbody.exe.
• Hand in a screenshot of “nbody” execution.
13
Academic Honesty
• You are allowed and encouraged to discuss assignments
with other students in the class. Getting verbal advice/help
from people who’ve already taken the course is also fine.
• Any reference to assignments from previous terms or web
postings is unacceptable
• Any copying of non-trivial code is unacceptable
– Non-trivial = more than a line or so
– Includes reading someone else’s code and then going off to write
your own.
14
Academic Honesty (cont.)
• Giving/receiving help on an exam is unacceptable
15
Text
1. D. Kirk and W. Hwu, “Programming Massively
Parallel Processors – A Hands-on Approach,”
Morgan Kaufman Publisher, 3rd edition, 2016
16
Biomolecular simulation: a computational
microscope for molecular biology
• Large clusters (scale out) allow simulation of biological systems of
realistic space dimensions
– Interesting biological systems have dimensions of mm or larger
– Thousands of nodes are required to hold and update all the grid points.
• Fast nodes (scale up) allow simulation at realistic time scales
– Simulation time steps at femtosecond (10-15 second) level needed for
accuracy
– Biological processes take miliseconds or longer
– Current molecular dynamics simulations progress at about one day for
each 10-100 microseconds of the simulated process.
Blue Waters Breakthrough---computational biology
Determination of the structure of the HIV capsid
at atomic-level
Collaborative effort of experimental groups at the
U. of Pittsburgh and Vanderbilt U., and the
Schulten’s computational team at the U. of
Illinois.
64-million-atom HIV capsid simulation of the
process through which the capsid disassembles,
releasing its genetic material
a critical step in understanding HIV infection and
finding a target for antiviral drugs.
Blue Waters Computing System
Operational at Illinois since 3/201349,504 CPUs -- 4,224 GPUs
20
Frequency Scaled Too Fast 1993-2003
10000
100
10
85 87 89 91 93 95 97 99 01 03 05
Total Processor Power Increased
(super-scaling of frequency and chip size)
100
10
1
85 87 89 91 93 95 97 99 01 03
CPUs: Latency Oriented Design
High clock frequency
Large caches
– Convert long latency memory accesses ALU ALU
Control
to short latency cache accesses ALU ALU
Sophisticated control CPU
– Branch prediction for reduced branch Cache
latency
– Data forwarding for reduced data DRAM
latency
Powerful ALU
– Reduced operation latency
GPUs: Throughput Oriented Design
Moderate clock frequency
Small caches
– To boost memory throughput
Simple control
– No branch prediction GPU
– No data forwarding
Energy efficient ALUs
– Many, long latency but heavily pipelined DRAM
for high throughput
Require massive number of threads to
tolerate latencies
Winning Strategies Use Both
CPU and GPU
• CPUs for sequential • GPUs for parallel parts
parts where latency where throughput wins
matters – GPUs can be 10+ faster
– CPUs can be 10+ faster than CPUs for parallel
than GPUs for sequential code
code
25
Heterogeneous parallel computing is catching on.
Data
Financial Scientific Engineering Medical
Intensive
Analysis Simulation Simulation Imaging
Analytics
Electronic
Digital Audio Digital Video Computer Biomedical
Design
Processing Processing Vision Informatics
Automation
© David Kirk/NVIDIA
27
and Wen-mei W. Hwu,
Load Balance
• The total amount of time to complete a parallel job is limited by the thread that
takes the longest to finish
good bad!
• Massively parallel
execution cannot
afford serialization
• Contentions in accessing
critical data causes
serialization
30
Summary─Massive Parallel Processing
• Applications must have parallelism
• Balanced load
• High bandwidth to memory
• Few conflicts (and serialization)
31