0% found this document useful (0 votes)
35 views

Advanced Computer Architecture Fall 2019 Multithreaded Architectures

Uploaded by

YAAKOV SOLOMON
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
35 views

Advanced Computer Architecture Fall 2019 Multithreaded Architectures

Uploaded by

YAAKOV SOLOMON
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 31

Advanced Computer Architecture

Fall 2019

Multithreaded Architectures
(Applied Parallel Programming)

Lecture 1: Introduction
1
Multicore processors

2
Multicore processors
• The benefit of having multiple cores is that the
system can handle multiple threads simultaneously
• Each core can handle a separate stream of data.
• This architecture greatly increases performance of a
system that is running parallel applications.

3
SMT-
Simultaneous
Multitheading

CMP-Chip
Multiprocessor
Multithreading cores
• A small number of threads
• Limited parallel processing capability
• Each thread runs a separate program

5
Massive Multithreading
• Very large number of threads (thousands)
• Massive parallelism
• SPMD model: Single Program Multiple Data
– Also called Data Parallelism
• Two kinds of threads:
– H/W threads─very fast switching (state kept in registers)
– S/W threads ─slower switching (thousands clock cycle)
6
Course Goals
• Learn to program massively parallel processors and achieve
– high performance

• Technical subjects
– principles of parallel algorithms
– processor architecture features and constraints
– programming techniques

7
Course Outline
1. Introduction
2. Data Parallel Computing
3. Scalable Parallel Execution
4. Memory and Data Locality
5. Performance Considerations
6. Numerical considerations
7. Parallel patterns: convolution
8. Parallel patterns: prefix sum
9. Parallel patterns: parallel histogram computation
10. Parallel patterns: sparse matrix computation
11. Parallel patterns: merge sort
12. Parallel patterns: graph search
13. Parallel processing with DSPs (and/or selected papers if time allows)
8
People
• Instructor:
Prof Shlomo Weiss
Office hours: by appointment
Office: rm 143 EE Maabadot building 1st floor
Mail: [email protected]

9
Web Resources
• Website: moodle
– Lecture slides (see notes for more details)
– Homework

– HW solution
– Forum for Q&A - your classmates often have answers

10
Grading
• Attendance mandatory!
• Exam: 85%
• Weekly homework: 15% (~1.7% per homework)
• Two weeks to upload your solution to moodle (check deadlines)
• Make sure to click on the submission button (otherwise it is a draft)
• Late homework will not be accepted!
• Submissions must be printed (handwritten homework not
acceptable)
• Homework done individually

11
Hands-on Homework
Three choices to run your programs on a CUDA device:
1.  PCs available in the following class rooms:
– Binyan Tochna class rooms: 010, 001
– Binyan Kitot class rooms: 003
– Visual Studio and the NVIDIA SDK are installed on these PCs. 
2. If you have a CUDA device on your PC or laptop you may do your
programming assignments at home. You’ll need visual studio
3. https://medium.com/@
iphoenix179/running-cuda-c-c-in-jupyter-or-how-to-run-nvcc-in-googl
e-colab-663d33f53772

12
Getting started on CUDA
For your first homework assignment:
• Open
https://docs.nvidia.com/cuda/cuda-quick-start-guide/index.h
tml
• Follow the instructions in the cuda-quick-start-guide
• Select the “nbody” sample from those provided with the
CUDA installation.
• Run nbody.exe.
• Hand in a screenshot of “nbody” execution.
13
Academic Honesty
• You are allowed and encouraged to discuss assignments
with other students in the class. Getting verbal advice/help
from people who’ve already taken the course is also fine.
• Any reference to assignments from previous terms or web
postings is unacceptable
• Any copying of non-trivial code is unacceptable
– Non-trivial = more than a line or so
– Includes reading someone else’s code and then going off to write
your own.

14
Academic Honesty (cont.)
• Giving/receiving help on an exam is unacceptable

• Penalties for academic dishonesty:


– Zero on the assignment for the first occasion
– Automatic failure of the course for repeat offenses

15
Text
1. D. Kirk and W. Hwu, “Programming Massively
Parallel Processors – A Hands-on Approach,”
Morgan Kaufman Publisher, 3rd edition, 2016

16
Biomolecular simulation: a computational
microscope for molecular biology
• Large clusters (scale out) allow simulation of biological systems of
realistic space dimensions
– Interesting biological systems have dimensions of mm or larger
– Thousands of nodes are required to hold and update all the grid points.
• Fast nodes (scale up) allow simulation at realistic time scales
– Simulation time steps at femtosecond (10-15 second) level needed for
accuracy
– Biological processes take miliseconds or longer
– Current molecular dynamics simulations progress at about one day for
each 10-100 microseconds of the simulated process.
Blue Waters Breakthrough---computational biology
 Determination of the structure of the HIV capsid
at atomic-level
 Collaborative effort of experimental groups at the
U. of Pittsburgh and Vanderbilt U., and the
Schulten’s computational team at the U. of
Illinois.
 64-million-atom HIV capsid simulation of the
process through which the capsid disassembles,
releasing its genetic material
 a critical step in understanding HIV infection and
finding a target for antiviral drugs.
Blue Waters Computing System
Operational at Illinois since 3/201349,504 CPUs -- 4,224 GPUs

12.5 PF IB Switch >1 TB/sec


10/40/100 Gb
1.6 PB DRAM Ethernet Switch
$250M 100 GB/sec
120+ Gb/sec

WAN Spectra Logic: 300 PBs Sonexion: 26 PBs


Qualcomm SoC for Mobile

20
Frequency Scaled Too Fast 1993-2003

10000

Clock Frequency (MHz)


1000

100

10
85 87 89 91 93 95 97 99 01 03 05
Total Processor Power Increased
(super-scaling of frequency and chip size)

100

10

1
85 87 89 91 93 95 97 99 01 03
CPUs: Latency Oriented Design
 High clock frequency
 Large caches
– Convert long latency memory accesses ALU ALU
Control
to short latency cache accesses ALU ALU
 Sophisticated control CPU
– Branch prediction for reduced branch Cache

latency
– Data forwarding for reduced data DRAM
latency
 Powerful ALU
– Reduced operation latency
GPUs: Throughput Oriented Design
 Moderate clock frequency
 Small caches
– To boost memory throughput
 Simple control
– No branch prediction GPU
– No data forwarding
 Energy efficient ALUs
– Many, long latency but heavily pipelined DRAM
for high throughput
 Require massive number of threads to
tolerate latencies
Winning Strategies Use Both
CPU and GPU
• CPUs for sequential • GPUs for parallel parts
parts where latency where throughput wins
matters – GPUs can be 10+ faster
– CPUs can be 10+ faster than CPUs for parallel
than GPUs for sequential code
code

25
Heterogeneous parallel computing is catching on.
Data
Financial Scientific Engineering Medical
Intensive
Analysis Simulation Simulation Imaging
Analytics
Electronic
Digital Audio Digital Video Computer Biomedical
Design
Processing Processing Vision Informatics
Automation

Statistical Ray Tracing Interactive Numerical


Modeling Rendering Physics Methods

• 280 submissions to GPU Computing Gems


– 110 articles included in two volumes
26
Massive
Parallelism -
Regularity

© David Kirk/NVIDIA
27
and Wen-mei W. Hwu,
Load Balance
• The total amount of time to complete a parallel job is limited by the thread that
takes the longest to finish

good bad!

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2016 28


ECE408/CS483, University of Illinois, Urbana-Champaign
Global Memory Bandwidth
Ideal Reality

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2016 29


ECE408/CS483, University of Illinois, Urbana-Champaign
Conflicting Data Accesses Cause Serialization and Delays

• Massively parallel
execution cannot
afford serialization

• Contentions in accessing
critical data causes
serialization

30
Summary─Massive Parallel Processing
• Applications must have parallelism
• Balanced load
• High bandwidth to memory
• Few conflicts (and serialization)

31

You might also like