SlideShare a Scribd company logo
Graphics Processing Unit (GPU) Architecture and Programming TU/e 5kk73 Zhenyu Ye Bart Mesman Henk Corporaal 2010-11-08
Today's Topics GPU architecture GPU programming GPU micro-architecture Performance optimization and model Trends
Today's Topics GPU architecture GPU programming GPU micro-architecture Performance optimization and model Trends
System Architecture
GPU Architecture NVIDIA Fermi, 512 Processing Elements (PEs)
What Can It Do? Render triangles. NVIDIA GTX480 can render 1.6 billion triangles per second!
General Purposed Computing ref:  http://www.nvidia.com/object/tesla_computing_solutions.html
The Vision of NVIDIA "Within the next few years, there will be single-chip graphics devices more  powerful  and  versatile  than any graphics system that has ever been built, at any price."  -- David Kirk, NVIDIA,  1998
Single-Chip GPU v.s. Fastest Super Computers ref:  http://www.llnl.gov/str/JanFeb05/Seager.html
Top500 Super Computer in June 2010
GPU Will Top the List in Nov 2010
The Gap Between CPU and GPU ref:  Tesla GPU Computing Brochure
GPU Has 10x Comp Density Given the  same chip area , the  achievable performance  of GPU is 10x higher than that of CPU.
Evolution of Intel Pentium Pentium I Pentium II Pentium III Pentium IV Chip area breakdown Q: What can you observe? Why?
Extrapolation of Single Core CPU If we extrapolate the trend, in a few generations, Pentium will look like: Of course, we know it did not happen.  Q: What happened instead? Why?
Evolution of Multi-core CPUs Penryn Bloomfield Gulftown Beckton Chip area breakdown Q: What can you observe? Why?
Let's Take a Closer Look Less than  10%  of total chip area is used for the real execution. Q: Why?
The Memory Hierarchy Notes on Energy at 45nm:  64-bit Int ADD takes about 1 pJ. 64-bit FP FMA takes about 200 pJ. It seems we can not further increase the computational density.
The Brick Wall -- UC Berkeley's View Power Wall : power expensive, transistors free Memory Wall : Memory slow, multiplies fast ILP Wall : diminishing returns on more ILP HW David Patterson, "Computer Architecture is Back - The Berkeley View of the Parallel Computing Research Landscape", Stanford EE Computer Systems Colloquium, Jan 2007,  link
The Brick Wall -- UC Berkeley's View Power Wall : power expensive, transistors free Memory Wall : Memory slow, multiplies fast ILP Wall : diminishing returns on more ILP HW Power Wall  +  Memory Wall  +  ILP Wall  =  Brick Wall David Patterson, "Computer Architecture is Back - The Berkeley View of the Parallel Computing Research Landscape", Stanford EE Computer Systems Colloquium, Jan 2007,  link
How to Break the Brick Wall? Hint: how to exploit the parallelism inside the application?
Step 1: Trade Latency with Throughput Hind the memory latency through fine-grained interleaved threading.
Interleaved Multi-threading
Interleaved Multi-threading The  granularity  of interleaved multi-threading: 100 cycles : hide off-chip memory latency 10 cycles : + hide cache latency 1 cycle : + hide branch latency, instruction dependency
Interleaved Multi-threading The granularity of interleaved multi-threading: 100 cycles: hide off-chip memory latency 10 cycles: + hide cache latency 1 cycle: + hide branch latency, instruction dependency Fine-grained interleaved multi-threading: Pros : ? Cons : ?
Interleaved Multi-threading The granularity of interleaved multi-threading: 100 cycles: hide off-chip memory latency 10 cycles: + hide cache latency 1 cycle: + hide branch latency, instruction dependency Fine-grained interleaved multi-threading: Pros : remove branch predictor, OOO scheduler, large cache Cons : register pressure, etc.
Fine-Grained Interleaved Threading Pros:  reduce cache size, no branch predictor,  no OOO scheduler Cons:  register pressure, thread scheduler, require huge parallelism Without and with fine-grained interleaved threading
HW Support Register file supports  zero overhead  context switch between interleaved threads.
Can We Make Further Improvement? Reducing large cache gives 2x computational density. Q: Can we make further improvements? Hint: We have only utilized thread level parallelism (TLP) so far.
Step 2: Single Instruction Multiple Data SSE has 4 data lanes GPU has 8/16/24/... data lanes GPU uses wide SIMD: 8/16/24/... processing elements (PEs) CPU uses short SIMD: usually has vector width of 4.
Hardware Support Supporting interleaved threading + SIMD execution
Single Instruction Multiple Thread (SIMT) Hide vector width using scalar threads.
Example of SIMT Execution Assume 32 threads are grouped into one warp.
Step 3: Simple Core The Stream Multiprocessor (SM) is a light weight core compared to IA core. Light weight PE: Fused Multiply Add (FMA) SFU: Special Function Unit
NVIDIA's Motivation of Simple Core "This [multiple IA-core] approach is analogous to trying to build an airplane by putting wings on a train." --Bill Dally, NVIDIA
Review: How Do We Reach Here? NVIDIA Fermi, 512 Processing Elements (PEs)
Throughput Oriented Architectures Fine-grained interleaved threading (~2x comp density) SIMD/SIMT (>10x comp density) Simple core (~2x comp density) Key architectural features of throughput oriented processor. ref: Michael Garland. David B. Kirk, "Understanding throughput-oriented architectures", CACM 2010. ( link )
Today's Topics GPU architecture GPU programming GPU micro-architecture Performance optimization and model Trends
CUDA Programming Massive number (>10000) of  light-weight  threads.
Express Data Parallelism in Threads  Compare thread program with vector program.
Vector Program Scalar program   float A[4][8]; do-all(i=0;i<4;i++){      do-all(j=0;j<8;j++){          A[i][j]++;       } }  Vector program (vector width of 8) float A[4][8]; do-all(i=0;i<4;i++){      movups xmm0, [ &A[i][0] ]      incps xmm0      movups [ &A[i][0] ], xmm0 }   Vector width is exposed to programmers.
CUDA Program Scalar program   float A[4][8]; do-all(i=0;i<4;i++) {      do-all(j=0;j<8;j++) {          A[i][j]++;       } }  CUDA program float A[4][8];    kernelF<<<(4,1),(8,1)>>>(A);   __device__    kernelF(A){      i = blockIdx.x;      j = threadIdx.x;      A[i][j]++; }   CUDA program expresses data level parallelism (DLP) in terms of thread level parallelism (TLP). Hardware converts TLP into DLP at run time.
Two Levels of Thread Hierarchy kernelF<<<(4,1),(8,1)>>>(A);   __device__    kernelF(A){      i = blockIdx.x;      j = threadIdx.x;      A[i][j]++; }  
Multi-dimension Thread and Block ID kernelF<<<(2,2),(4,2)>>>(A);   __device__    kernelF(A){      i = blockDim.x * blockIdx.y          + blockIdx.x;      j = threadDim.x * threadIdx.y          + threadIdx.x;      A[i][j]++; }   Both grid and thread block can have two dimensional index.
Scheduling Thread Blocks on SM Example: Scheduling 4 thread blocks on 3 SMs.
Executing Thread Block on SM kernelF<<<(2,2), (4,2) >>>(A);   __device__    kernelF(A){      i = blockDim.x * blockIdx.y          + blockIdx.x;      j = threadDim.x * threadIdx.y          + threadIdx.x;      A[i][j]++; }   Executed on machine with width of 4: Executed on machine with width of 8: Notes: the number of Processing Elements (PEs) is transparent to programmer.
Multiple Levels of Memory Hierarchy Name Cache? cycle read-only? Global L1/L2 200~400 (cache miss) R/W Shared No 1~3 R/W Constant Yes 1~3 Read-only Texture Yes ~100 Read-only Local L1/L2 200~400 (cache miss) R/W
Explicit Management of Shared Mem Shared memory is frequently used to exploit locality.
Shared Memory and Synchronization kernelF<<<(1,1),(16,16)>>>(A);   __device__    kernelF(A){      __shared__ smem[16][16]; //allocate smem      i = threadIdx.y;      j = threadIdx.x;      smem[i][j] = A[i][j];      __sync();      A[i][j] = ( smem[i-1][j-1]                     + smem[i-1][j]                     ...                     + smem[i+1][i+1] ) / 9; }   Example: average filter with 3x3 window 3x3 window on image Image data in DRAM
Shared Memory and Synchronization kernelF<<<(1,1),(16,16)>>>(A);   __device__    kernelF(A){      __shared__ smem[16][16];      i = threadIdx.y;      j = threadIdx.x;      smem[i][j] = A[i][j]; // load to smem      __sync(); // thread wait at barrier      A[i][j] = ( smem[i-1][j-1]                     + smem[i-1][j]                     ...                     + smem[i+1][i+1] ) / 9; }   Example: average filter over 3x3 window 3x3 window on image Stage data in shared mem
Shared Memory and Synchronization kernelF<<<(1,1),(16,16)>>>(A);   __device__    kernelF(A){      __shared__ smem[16][16];      i = threadIdx.y;      j = threadIdx.x;      smem[i][j] = A[i][j];      __sync(); // every thread is ready      A[i][j] = ( smem[i-1][j-1]                     + smem[i-1][j]                     ...                     + smem[i+1][i+1] ) / 9; }   Example: average filter over 3x3 window 3x3 window on image all threads finish the load
Shared Memory and Synchronization kernelF<<<(1,1),(16,16)>>>(A);   __device__    kernelF(A){      __shared__ smem[16][16];      i = threadIdx.y;      j = threadIdx.x;      smem[i][j] = A[i][j];      __sync();      A[i][j] = ( smem[i-1][j-1]                     + smem[i-1][j]                     ...                     + smem[i+1][i+1] ) / 9; }   Example: average filter over 3x3 window 3x3 window on image Start computation
Programmers Think in Threads Q: Why make this hassle?
Why Use Thread instead of Vector? Thread Pros: Portability . Machine width is transparent in ISA. Productivity . Programmers do not need to take care the vector width of the machine. Thread Cons: Manual sync . Give up lock-step within vector. Scheduling  of thread could be inefficient. Debug . &quot;Threads considered harmful&quot;. Thread program is notoriously hard to debug.  
Features of CUDA Programmers explicitly express DLP in terms of TLP. Programmers explicitly manage memory hierarchy. etc.
Today's Topics GPU architecture GPU programming GPU micro-architecture Performance optimization and model Trends
Micro-architecture GF100 micro-architecture
HW Groups Threads Into Warps Example: 32 threads per warp
Example of Implementation Note: NVIDIA may use a more complicated implementation.
Example Program Address :  Inst 0x0004 :  add r0, r1, r2 0x0008 :  sub r3, r4, r5 Assume  warp 0  and  warp 1  are scheduled for execution.
Read Src Op Program Address: Inst 0x0004: add r0,  r1 , r2 0x0008: sub r3,  r4 , r5 Read source operands: r1  for warp 0 r4  for warp 1
Buffer Src Op Program Address: Inst 0x0004: add r0,  r1 , r2 0x0008: sub r3,  r4 , r5 Push ops to op collector: r1  for warp 0 r4  for warp 1
Read Src Op Program Address: Inst 0x0004: add r0, r1,  r2 0x0008: sub r3, r4,  r5 Read source operands: r2  for warp 0 r5  for warp 1
Buffer Src Op Program Address: Inst 0x0004: add r0, r1,  r2 0x0008: sub r3, r4,  r5 Push ops to op collector: r2  for warp 0 r5  for warp 1
Execute Program Address: Inst 0x0004:  add  r0, r1, r2 0x0008:  sub  r3, r4, r5 Compute the  first 16 threads  in the warp.
Execute Program Address: Inst 0x0004:  add  r0, r1, r2 0x0008:  sub  r3, r4, r5 Compute the  last 16 threads  in the warp.
Write back Program Address: Inst 0x0004: add  r0 , r1, r2 0x0008: sub  r3 , r4, r5 Write back: r0  for warp 0 r3  for warp 1
Other High Performance GPU ATI Radeon 5000 series.
ATI Radeon 5000 Series Architecture
Radeon SIMD Engine 16 Stream Cores (SC) Local Data Share
VLIW Stream Core (SC)
Local Data Share (LDS)
Today's Topics GPU architecture GPU programming GPU micro-architecture Performance optimization and model Trends
Performance Optimization Optimizations on memory latency tolerance Reduce register pressure Reduce shared memory pressure     Optimizations on memory bandwidth  Global memory coalesce  Avoid shared memory bank conflicts Grouping byte access  Avoid Partition camping   Optimizations on computation efficiency  Mul/Add balancing Increase floating point proportion    Optimizations on operational intensity  Use tiled algorithm Tuning thread granularity
Performance Optimization Optimizations on memory latency tolerance Reduce register pressure Reduce shared memory pressure     Optimizations on memory bandwidth  Global memory coalesce  Avoid shared memory bank conflicts Grouping byte access  Avoid Partition camping   Optimizations on computation efficiency  Mul/Add balancing Increase floating point proportion    Optimizations on operational intensity  Use tiled algorithm Tuning thread granularity
Shared Mem Contains Multiple Banks
Compute Capability Need arch info to perform optimization. ref: NVIDIA, &quot;CUDA C Programming Guide&quot;, ( link )
Shared Memory (compute capability 2.x) without bank conflict: with bank conflict:
Performance Optimization Optimizations on memory latency tolerance Reduce register pressure Reduce shared memory pressure     Optimizations on memory bandwidth  Global memory alignment and coalescing Avoid shared memory bank conflicts Grouping byte access  Avoid Partition camping   Optimizations on computation efficiency  Mul/Add balancing Increase floating point proportion    Optimizations on operational intensity  Use tiled algorithm Tuning thread granularity
Global Memory In Off-Chip DRAM Address space is interleaved among multiple channels.
Global Memory
Global Memory
Global Memory
Roofline Model Identify performance bottleneck:  computation bound  v.s.  bandwidth bound
Optimization Is Key for Attainable Gflops/s
Computation, Bandwidth, Latency Illustrating three bottlenecks in the Roofline model.
Today's Topics GPU architecture GPU programming GPU micro-architecture Performance optimization and model Trends
Trends Coming architectures: Intel's Larabee successor: Many Integrated Core (MIC) CPU/GPU fusion, Intel Sandy Bridge, AMD Llano.
Intel Many Integrated Core (MIC) 32 core version of MIC:
Intel Sandy Bridge Highlight: Reconfigurable shared L3 for CPU and GPU Ring bus
Sandy Bridge's New CPU-GPU interface  ref: &quot;Intel's Sandy Bridge Architecture Exposed&quot;, from Anandtech, ( link )
Sandy Bridge's New CPU-GPU interface  ref: &quot;Intel's Sandy Bridge Architecture Exposed&quot;, from Anandtech, ( link )
AMD Llano Fusion APU (expt. Q3 2011) Notes: CPU and GPU are not sharing cache? Unknown interface between CPU/GPU
GPU Research in ES Group GPU research in the Electronic Systems group. http://www.es.ele.tue.nl/~gpuattue/

More Related Content

PPTX
Google colab introduction
PPTX
Recommendation System
PPT
Social Recommender Systems
PPT
Social Network Analysis
PPT
Concorrência na Linguagem de Programação
PPTX
Social Media Mining - Chapter 2 (Graph Essentials)
PPTX
Python Programming with Google Colab
PPTX
Social Media Mining - Chapter 9 (Recommendation in Social Media)
Google colab introduction
Recommendation System
Social Recommender Systems
Social Network Analysis
Concorrência na Linguagem de Programação
Social Media Mining - Chapter 2 (Graph Essentials)
Python Programming with Google Colab
Social Media Mining - Chapter 9 (Recommendation in Social Media)

What's hot (20)

PPTX
Community detection in social networks
PPTX
Collaborative filtering
PPTX
Social Media Mining - Chapter 8 (Influence and Homophily)
PPTX
Recommender Systems
PPTX
Usability Heuristics - Principles & Examples
PDF
Making Netflix Machine Learning Algorithms Reliable
PDF
Challenges and Solutions in Group Recommender Systems
PDF
Community Detection in Social Media
PDF
Introducing Apache Giraph for Large Scale Graph Processing
PDF
게임 개발에 적용해보자. 머신 러닝과 인공 지능! – 김일호 AWS 이머징 테크 리드 솔루션즈 아키텍트:: AWS Cloud Week - ...
PPTX
Recommender system
PDF
Recommender Systems
PPTX
Collaborative Filtering using KNN
PPTX
Recommendation Modeling with Impression Data at Netflix
PDF
프로그래머가 알아야 하는 메모리 관리 기법
PPTX
Learning a Personalized Homepage
PDF
Basics of Computer Coding: Understanding Coding Languages
PPTX
Social Media Mining - Chapter 10 (Behavior Analytics)
PPTX
East Coast DevCon 2014: Concurrency & Parallelism in UE4 - Tips for programmi...
PDF
Recommender systems
Community detection in social networks
Collaborative filtering
Social Media Mining - Chapter 8 (Influence and Homophily)
Recommender Systems
Usability Heuristics - Principles & Examples
Making Netflix Machine Learning Algorithms Reliable
Challenges and Solutions in Group Recommender Systems
Community Detection in Social Media
Introducing Apache Giraph for Large Scale Graph Processing
게임 개발에 적용해보자. 머신 러닝과 인공 지능! – 김일호 AWS 이머징 테크 리드 솔루션즈 아키텍트:: AWS Cloud Week - ...
Recommender system
Recommender Systems
Collaborative Filtering using KNN
Recommendation Modeling with Impression Data at Netflix
프로그래머가 알아야 하는 메모리 관리 기법
Learning a Personalized Homepage
Basics of Computer Coding: Understanding Coding Languages
Social Media Mining - Chapter 10 (Behavior Analytics)
East Coast DevCon 2014: Concurrency & Parallelism in UE4 - Tips for programmi...
Recommender systems
Ad

Viewers also liked (20)

PPTX
Indian Contribution towards Parallel Processing
PPTX
Parallel computing in india
PPTX
network ram parallel computing
PPT
Graphics Processing Unit - GPU
PPTX
Graphic Processing Unit (GPU)
PPTX
GRAPHICS PROCESSING UNIT (GPU)
PDF
NVIDIA – Inventor of the GPU
PPTX
tesla home battery power wall by braj mohan
PDF
Tesla Powerwall
PPTX
Surface Computer
PPT
Racetrack
PPTX
surface computer ppt
PPT
Microsoft surface by NIRAV RANA
PPTX
Solar battery storage for your home battery
PPTX
Riscv 20160507-patterson
PPT
Surface computer
PDF
Powerwall installation and user's manual online-b
PDF
Gpu Systems
PPTX
Surface computer
Indian Contribution towards Parallel Processing
Parallel computing in india
network ram parallel computing
Graphics Processing Unit - GPU
Graphic Processing Unit (GPU)
GRAPHICS PROCESSING UNIT (GPU)
NVIDIA – Inventor of the GPU
tesla home battery power wall by braj mohan
Tesla Powerwall
Surface Computer
Racetrack
surface computer ppt
Microsoft surface by NIRAV RANA
Solar battery storage for your home battery
Riscv 20160507-patterson
Surface computer
Powerwall installation and user's manual online-b
Gpu Systems
Surface computer
Ad

Similar to Gpu and The Brick Wall (20)

PPTX
Intro to GPGPU Programming with Cuda
PPT
Introduction to parallel computing using CUDA
PPTX
Intro to GPGPU with CUDA (DevLink)
PPT
Cuda 2011
PPTX
General Purpose Computing using Graphics Hardware
PDF
VSCSE-Lecture3-cuda-memory-model-2012.pdf
PPT
Vpu technology &gpgpu computing
PDF
NVidia CUDA Tutorial - June 15, 2009
PDF
Trip down the GPU lane with Machine Learning
PDF
IAP09 CUDA@MIT 6.963 - Lecture 01: GPU Computing using CUDA (David Luebke, NV...
PPTX
Data-Level Parallelism in Microprocessors
PDF
Code GPU with CUDA - Optimizing memory and control flow
PPTX
Gpu archi
PDF
[Harvard CS264] 05 - Advanced-level CUDA Programming
PPT
Vpu technology &gpgpu computing
PPT
Vpu technology &gpgpu computing
PPT
Vpu technology &gpgpu computing
PPT
Lecture5 cuda-memory-spring-2010
PDF
Using GPUs for parallel processing
PPT
002 - Introduction to CUDA Programming_1.ppt
Intro to GPGPU Programming with Cuda
Introduction to parallel computing using CUDA
Intro to GPGPU with CUDA (DevLink)
Cuda 2011
General Purpose Computing using Graphics Hardware
VSCSE-Lecture3-cuda-memory-model-2012.pdf
Vpu technology &gpgpu computing
NVidia CUDA Tutorial - June 15, 2009
Trip down the GPU lane with Machine Learning
IAP09 CUDA@MIT 6.963 - Lecture 01: GPU Computing using CUDA (David Luebke, NV...
Data-Level Parallelism in Microprocessors
Code GPU with CUDA - Optimizing memory and control flow
Gpu archi
[Harvard CS264] 05 - Advanced-level CUDA Programming
Vpu technology &gpgpu computing
Vpu technology &gpgpu computing
Vpu technology &gpgpu computing
Lecture5 cuda-memory-spring-2010
Using GPUs for parallel processing
002 - Introduction to CUDA Programming_1.ppt

More from ugur candan (20)

PPTX
Keynote Istanbul Innovation Day SAP BTP What is hot?
PPTX
SAP AI What are examples Oct2022
PPTX
CEO Agenda 2019 by Ugur Candan
PDF
Digital transformation and SAP
PPTX
Digital Enterprise Transformsation and SAP
PPTX
MOONSHOTS for in-memory computing
PPTX
WHY SAP Real Time Data Platform - RTDP
PPTX
Opening Analytics Networking Event
PPTX
Sap innovation forum istanbul 2012
PPTX
İş Zekasının Değişen Kuralları
PPTX
Gamification of eEducation
PPTX
Why sap hana
PDF
The End of an Architectural Era Michael Stonebraker
PDF
Ramcloud
PDF
Hana Intel SAP Whitepaper
PDF
The Berkeley View on the Parallel Computing Landscape
PDF
Exadata is still oracle
PPTX
Gerçek Gerçek Zamanlı Mimari
PPTX
Michael stonebraker mit session
PPTX
Introduction to HANA in-memory from SAP
Keynote Istanbul Innovation Day SAP BTP What is hot?
SAP AI What are examples Oct2022
CEO Agenda 2019 by Ugur Candan
Digital transformation and SAP
Digital Enterprise Transformsation and SAP
MOONSHOTS for in-memory computing
WHY SAP Real Time Data Platform - RTDP
Opening Analytics Networking Event
Sap innovation forum istanbul 2012
İş Zekasının Değişen Kuralları
Gamification of eEducation
Why sap hana
The End of an Architectural Era Michael Stonebraker
Ramcloud
Hana Intel SAP Whitepaper
The Berkeley View on the Parallel Computing Landscape
Exadata is still oracle
Gerçek Gerçek Zamanlı Mimari
Michael stonebraker mit session
Introduction to HANA in-memory from SAP

Recently uploaded (20)

PDF
BLW VOCATIONAL TRAINING SUMMER INTERNSHIP REPORT
PDF
Event Presentation Google Cloud Next Extended 2025
PDF
REPORT: Heating appliances market in Poland 2024
PDF
KodekX | Application Modernization Development
PDF
How Onsite IT Support Drives Business Efficiency, Security, and Growth.pdf
PDF
ai-archetype-understanding-the-personality-of-agentic-ai.pdf
PPTX
breach-and-attack-simulation-cybersecurity-india-chennai-defenderrabbit-2025....
PPTX
How to Build Crypto Derivative Exchanges from Scratch.pptx
PDF
Reimagining Insurance: Connected Data for Confident Decisions.pdf
PDF
AI And Its Effect On The Evolving IT Sector In Australia - Elevate
PDF
GamePlan Trading System Review: Professional Trader's Honest Take
PDF
Modernizing your data center with Dell and AMD
PDF
Sensors and Actuators in IoT Systems using pdf
PPTX
Comunidade Salesforce São Paulo - Desmistificando o Omnistudio (Vlocity)
PDF
Automating ArcGIS Content Discovery with FME: A Real World Use Case
PDF
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PDF
CIFDAQ's Market Insight: SEC Turns Pro Crypto
PDF
NewMind AI Monthly Chronicles - July 2025
BLW VOCATIONAL TRAINING SUMMER INTERNSHIP REPORT
Event Presentation Google Cloud Next Extended 2025
REPORT: Heating appliances market in Poland 2024
KodekX | Application Modernization Development
How Onsite IT Support Drives Business Efficiency, Security, and Growth.pdf
ai-archetype-understanding-the-personality-of-agentic-ai.pdf
breach-and-attack-simulation-cybersecurity-india-chennai-defenderrabbit-2025....
How to Build Crypto Derivative Exchanges from Scratch.pptx
Reimagining Insurance: Connected Data for Confident Decisions.pdf
AI And Its Effect On The Evolving IT Sector In Australia - Elevate
GamePlan Trading System Review: Professional Trader's Honest Take
Modernizing your data center with Dell and AMD
Sensors and Actuators in IoT Systems using pdf
Comunidade Salesforce São Paulo - Desmistificando o Omnistudio (Vlocity)
Automating ArcGIS Content Discovery with FME: A Real World Use Case
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
Understanding_Digital_Forensics_Presentation.pptx
Chapter 3 Spatial Domain Image Processing.pdf
CIFDAQ's Market Insight: SEC Turns Pro Crypto
NewMind AI Monthly Chronicles - July 2025

Gpu and The Brick Wall

  • 1. Graphics Processing Unit (GPU) Architecture and Programming TU/e 5kk73 Zhenyu Ye Bart Mesman Henk Corporaal 2010-11-08
  • 2. Today's Topics GPU architecture GPU programming GPU micro-architecture Performance optimization and model Trends
  • 3. Today's Topics GPU architecture GPU programming GPU micro-architecture Performance optimization and model Trends
  • 5. GPU Architecture NVIDIA Fermi, 512 Processing Elements (PEs)
  • 6. What Can It Do? Render triangles. NVIDIA GTX480 can render 1.6 billion triangles per second!
  • 7. General Purposed Computing ref:  http://www.nvidia.com/object/tesla_computing_solutions.html
  • 8. The Vision of NVIDIA &quot;Within the next few years, there will be single-chip graphics devices more powerful and versatile than any graphics system that has ever been built, at any price.&quot;  -- David Kirk, NVIDIA,  1998
  • 9. Single-Chip GPU v.s. Fastest Super Computers ref:  http://www.llnl.gov/str/JanFeb05/Seager.html
  • 10. Top500 Super Computer in June 2010
  • 11. GPU Will Top the List in Nov 2010
  • 12. The Gap Between CPU and GPU ref: Tesla GPU Computing Brochure
  • 13. GPU Has 10x Comp Density Given the same chip area , the achievable performance of GPU is 10x higher than that of CPU.
  • 14. Evolution of Intel Pentium Pentium I Pentium II Pentium III Pentium IV Chip area breakdown Q: What can you observe? Why?
  • 15. Extrapolation of Single Core CPU If we extrapolate the trend, in a few generations, Pentium will look like: Of course, we know it did not happen.  Q: What happened instead? Why?
  • 16. Evolution of Multi-core CPUs Penryn Bloomfield Gulftown Beckton Chip area breakdown Q: What can you observe? Why?
  • 17. Let's Take a Closer Look Less than 10% of total chip area is used for the real execution. Q: Why?
  • 18. The Memory Hierarchy Notes on Energy at 45nm:  64-bit Int ADD takes about 1 pJ. 64-bit FP FMA takes about 200 pJ. It seems we can not further increase the computational density.
  • 19. The Brick Wall -- UC Berkeley's View Power Wall : power expensive, transistors free Memory Wall : Memory slow, multiplies fast ILP Wall : diminishing returns on more ILP HW David Patterson, &quot;Computer Architecture is Back - The Berkeley View of the Parallel Computing Research Landscape&quot;, Stanford EE Computer Systems Colloquium, Jan 2007, link
  • 20. The Brick Wall -- UC Berkeley's View Power Wall : power expensive, transistors free Memory Wall : Memory slow, multiplies fast ILP Wall : diminishing returns on more ILP HW Power Wall + Memory Wall + ILP Wall = Brick Wall David Patterson, &quot;Computer Architecture is Back - The Berkeley View of the Parallel Computing Research Landscape&quot;, Stanford EE Computer Systems Colloquium, Jan 2007, link
  • 21. How to Break the Brick Wall? Hint: how to exploit the parallelism inside the application?
  • 22. Step 1: Trade Latency with Throughput Hind the memory latency through fine-grained interleaved threading.
  • 24. Interleaved Multi-threading The granularity of interleaved multi-threading: 100 cycles : hide off-chip memory latency 10 cycles : + hide cache latency 1 cycle : + hide branch latency, instruction dependency
  • 25. Interleaved Multi-threading The granularity of interleaved multi-threading: 100 cycles: hide off-chip memory latency 10 cycles: + hide cache latency 1 cycle: + hide branch latency, instruction dependency Fine-grained interleaved multi-threading: Pros : ? Cons : ?
  • 26. Interleaved Multi-threading The granularity of interleaved multi-threading: 100 cycles: hide off-chip memory latency 10 cycles: + hide cache latency 1 cycle: + hide branch latency, instruction dependency Fine-grained interleaved multi-threading: Pros : remove branch predictor, OOO scheduler, large cache Cons : register pressure, etc.
  • 27. Fine-Grained Interleaved Threading Pros:  reduce cache size, no branch predictor,  no OOO scheduler Cons:  register pressure, thread scheduler, require huge parallelism Without and with fine-grained interleaved threading
  • 28. HW Support Register file supports zero overhead context switch between interleaved threads.
  • 29. Can We Make Further Improvement? Reducing large cache gives 2x computational density. Q: Can we make further improvements? Hint: We have only utilized thread level parallelism (TLP) so far.
  • 30. Step 2: Single Instruction Multiple Data SSE has 4 data lanes GPU has 8/16/24/... data lanes GPU uses wide SIMD: 8/16/24/... processing elements (PEs) CPU uses short SIMD: usually has vector width of 4.
  • 31. Hardware Support Supporting interleaved threading + SIMD execution
  • 32. Single Instruction Multiple Thread (SIMT) Hide vector width using scalar threads.
  • 33. Example of SIMT Execution Assume 32 threads are grouped into one warp.
  • 34. Step 3: Simple Core The Stream Multiprocessor (SM) is a light weight core compared to IA core. Light weight PE: Fused Multiply Add (FMA) SFU: Special Function Unit
  • 35. NVIDIA's Motivation of Simple Core &quot;This [multiple IA-core] approach is analogous to trying to build an airplane by putting wings on a train.&quot; --Bill Dally, NVIDIA
  • 36. Review: How Do We Reach Here? NVIDIA Fermi, 512 Processing Elements (PEs)
  • 37. Throughput Oriented Architectures Fine-grained interleaved threading (~2x comp density) SIMD/SIMT (>10x comp density) Simple core (~2x comp density) Key architectural features of throughput oriented processor. ref: Michael Garland. David B. Kirk, &quot;Understanding throughput-oriented architectures&quot;, CACM 2010. ( link )
  • 38. Today's Topics GPU architecture GPU programming GPU micro-architecture Performance optimization and model Trends
  • 39. CUDA Programming Massive number (>10000) of light-weight threads.
  • 40. Express Data Parallelism in Threads  Compare thread program with vector program.
  • 41. Vector Program Scalar program   float A[4][8]; do-all(i=0;i<4;i++){     do-all(j=0;j<8;j++){         A[i][j]++;      } } Vector program (vector width of 8) float A[4][8]; do-all(i=0;i<4;i++){      movups xmm0, [ &A[i][0] ]      incps xmm0      movups [ &A[i][0] ], xmm0 }   Vector width is exposed to programmers.
  • 42. CUDA Program Scalar program   float A[4][8]; do-all(i=0;i<4;i++) {     do-all(j=0;j<8;j++) {         A[i][j]++;      } } CUDA program float A[4][8];   kernelF<<<(4,1),(8,1)>>>(A);   __device__    kernelF(A){      i = blockIdx.x;      j = threadIdx.x;     A[i][j]++; }   CUDA program expresses data level parallelism (DLP) in terms of thread level parallelism (TLP). Hardware converts TLP into DLP at run time.
  • 43. Two Levels of Thread Hierarchy kernelF<<<(4,1),(8,1)>>>(A);   __device__    kernelF(A){      i = blockIdx.x;      j = threadIdx.x;     A[i][j]++; }  
  • 44. Multi-dimension Thread and Block ID kernelF<<<(2,2),(4,2)>>>(A);   __device__    kernelF(A){      i = blockDim.x * blockIdx.y          + blockIdx.x;      j = threadDim.x * threadIdx.y          + threadIdx.x;     A[i][j]++; }   Both grid and thread block can have two dimensional index.
  • 45. Scheduling Thread Blocks on SM Example: Scheduling 4 thread blocks on 3 SMs.
  • 46. Executing Thread Block on SM kernelF<<<(2,2), (4,2) >>>(A);   __device__    kernelF(A){      i = blockDim.x * blockIdx.y          + blockIdx.x;      j = threadDim.x * threadIdx.y          + threadIdx.x;     A[i][j]++; }   Executed on machine with width of 4: Executed on machine with width of 8: Notes: the number of Processing Elements (PEs) is transparent to programmer.
  • 47. Multiple Levels of Memory Hierarchy Name Cache? cycle read-only? Global L1/L2 200~400 (cache miss) R/W Shared No 1~3 R/W Constant Yes 1~3 Read-only Texture Yes ~100 Read-only Local L1/L2 200~400 (cache miss) R/W
  • 48. Explicit Management of Shared Mem Shared memory is frequently used to exploit locality.
  • 49. Shared Memory and Synchronization kernelF<<<(1,1),(16,16)>>>(A);   __device__    kernelF(A){      __shared__ smem[16][16]; //allocate smem      i = threadIdx.y;      j = threadIdx.x;      smem[i][j] = A[i][j];      __sync();      A[i][j] = ( smem[i-1][j-1]                    + smem[i-1][j]                    ...                    + smem[i+1][i+1] ) / 9; }   Example: average filter with 3x3 window 3x3 window on image Image data in DRAM
  • 50. Shared Memory and Synchronization kernelF<<<(1,1),(16,16)>>>(A);   __device__    kernelF(A){      __shared__ smem[16][16];      i = threadIdx.y;      j = threadIdx.x;      smem[i][j] = A[i][j]; // load to smem      __sync(); // thread wait at barrier      A[i][j] = ( smem[i-1][j-1]                    + smem[i-1][j]                    ...                    + smem[i+1][i+1] ) / 9; }   Example: average filter over 3x3 window 3x3 window on image Stage data in shared mem
  • 51. Shared Memory and Synchronization kernelF<<<(1,1),(16,16)>>>(A);   __device__    kernelF(A){      __shared__ smem[16][16];      i = threadIdx.y;      j = threadIdx.x;      smem[i][j] = A[i][j];      __sync(); // every thread is ready      A[i][j] = ( smem[i-1][j-1]                    + smem[i-1][j]                    ...                    + smem[i+1][i+1] ) / 9; }   Example: average filter over 3x3 window 3x3 window on image all threads finish the load
  • 52. Shared Memory and Synchronization kernelF<<<(1,1),(16,16)>>>(A);   __device__    kernelF(A){      __shared__ smem[16][16];      i = threadIdx.y;      j = threadIdx.x;      smem[i][j] = A[i][j];      __sync();      A[i][j] = ( smem[i-1][j-1]                    + smem[i-1][j]                    ...                    + smem[i+1][i+1] ) / 9; }   Example: average filter over 3x3 window 3x3 window on image Start computation
  • 53. Programmers Think in Threads Q: Why make this hassle?
  • 54. Why Use Thread instead of Vector? Thread Pros: Portability . Machine width is transparent in ISA. Productivity . Programmers do not need to take care the vector width of the machine. Thread Cons: Manual sync . Give up lock-step within vector. Scheduling of thread could be inefficient. Debug . &quot;Threads considered harmful&quot;. Thread program is notoriously hard to debug.  
  • 55. Features of CUDA Programmers explicitly express DLP in terms of TLP. Programmers explicitly manage memory hierarchy. etc.
  • 56. Today's Topics GPU architecture GPU programming GPU micro-architecture Performance optimization and model Trends
  • 58. HW Groups Threads Into Warps Example: 32 threads per warp
  • 59. Example of Implementation Note: NVIDIA may use a more complicated implementation.
  • 60. Example Program Address : Inst 0x0004 : add r0, r1, r2 0x0008 : sub r3, r4, r5 Assume warp 0 and warp 1 are scheduled for execution.
  • 61. Read Src Op Program Address: Inst 0x0004: add r0, r1 , r2 0x0008: sub r3, r4 , r5 Read source operands: r1 for warp 0 r4 for warp 1
  • 62. Buffer Src Op Program Address: Inst 0x0004: add r0, r1 , r2 0x0008: sub r3, r4 , r5 Push ops to op collector: r1 for warp 0 r4 for warp 1
  • 63. Read Src Op Program Address: Inst 0x0004: add r0, r1, r2 0x0008: sub r3, r4, r5 Read source operands: r2 for warp 0 r5 for warp 1
  • 64. Buffer Src Op Program Address: Inst 0x0004: add r0, r1, r2 0x0008: sub r3, r4, r5 Push ops to op collector: r2 for warp 0 r5 for warp 1
  • 65. Execute Program Address: Inst 0x0004: add r0, r1, r2 0x0008: sub r3, r4, r5 Compute the first 16 threads in the warp.
  • 66. Execute Program Address: Inst 0x0004: add r0, r1, r2 0x0008: sub r3, r4, r5 Compute the last 16 threads in the warp.
  • 67. Write back Program Address: Inst 0x0004: add r0 , r1, r2 0x0008: sub r3 , r4, r5 Write back: r0 for warp 0 r3 for warp 1
  • 68. Other High Performance GPU ATI Radeon 5000 series.
  • 69. ATI Radeon 5000 Series Architecture
  • 70. Radeon SIMD Engine 16 Stream Cores (SC) Local Data Share
  • 73. Today's Topics GPU architecture GPU programming GPU micro-architecture Performance optimization and model Trends
  • 74. Performance Optimization Optimizations on memory latency tolerance Reduce register pressure Reduce shared memory pressure    Optimizations on memory bandwidth Global memory coalesce Avoid shared memory bank conflicts Grouping byte access Avoid Partition camping   Optimizations on computation efficiency Mul/Add balancing Increase floating point proportion    Optimizations on operational intensity Use tiled algorithm Tuning thread granularity
  • 75. Performance Optimization Optimizations on memory latency tolerance Reduce register pressure Reduce shared memory pressure    Optimizations on memory bandwidth Global memory coalesce Avoid shared memory bank conflicts Grouping byte access Avoid Partition camping   Optimizations on computation efficiency Mul/Add balancing Increase floating point proportion    Optimizations on operational intensity Use tiled algorithm Tuning thread granularity
  • 76. Shared Mem Contains Multiple Banks
  • 77. Compute Capability Need arch info to perform optimization. ref: NVIDIA, &quot;CUDA C Programming Guide&quot;, ( link )
  • 78. Shared Memory (compute capability 2.x) without bank conflict: with bank conflict:
  • 79. Performance Optimization Optimizations on memory latency tolerance Reduce register pressure Reduce shared memory pressure    Optimizations on memory bandwidth Global memory alignment and coalescing Avoid shared memory bank conflicts Grouping byte access Avoid Partition camping   Optimizations on computation efficiency Mul/Add balancing Increase floating point proportion    Optimizations on operational intensity Use tiled algorithm Tuning thread granularity
  • 80. Global Memory In Off-Chip DRAM Address space is interleaved among multiple channels.
  • 84. Roofline Model Identify performance bottleneck:  computation bound v.s. bandwidth bound
  • 85. Optimization Is Key for Attainable Gflops/s
  • 86. Computation, Bandwidth, Latency Illustrating three bottlenecks in the Roofline model.
  • 87. Today's Topics GPU architecture GPU programming GPU micro-architecture Performance optimization and model Trends
  • 88. Trends Coming architectures: Intel's Larabee successor: Many Integrated Core (MIC) CPU/GPU fusion, Intel Sandy Bridge, AMD Llano.
  • 89. Intel Many Integrated Core (MIC) 32 core version of MIC:
  • 90. Intel Sandy Bridge Highlight: Reconfigurable shared L3 for CPU and GPU Ring bus
  • 91. Sandy Bridge's New CPU-GPU interface  ref: &quot;Intel's Sandy Bridge Architecture Exposed&quot;, from Anandtech, ( link )
  • 92. Sandy Bridge's New CPU-GPU interface  ref: &quot;Intel's Sandy Bridge Architecture Exposed&quot;, from Anandtech, ( link )
  • 93. AMD Llano Fusion APU (expt. Q3 2011) Notes: CPU and GPU are not sharing cache? Unknown interface between CPU/GPU
  • 94. GPU Research in ES Group GPU research in the Electronic Systems group. http://www.es.ele.tue.nl/~gpuattue/

Editor's Notes

  • #6: NVIDIA planned to put 512 PEs into a single GPU, but the GTX480 turns out to have 480 PEs.
  • #8: GPU can achieve 10x performance over CPU. 
  • #11: Notice the third place is PowerXCell. Rmax is the performance of Linpack benchmark. Rpeak is the raw performance of the machine.
  • #13: This gap is narrowed by multi-core CPUs.
  • #14: Comparing raw performance is less interesting.
  • #15: The area breakdown is an approximation, but it is good enough to see the trend.
  • #17: The size of L3 in high end and low end CPUs are quite different.
  • #18: This break down is also an approximation.
  • #19: Numbers are based on Intel Nehalem at 45nm and the presentation of Bill Dally.
  • #23: More registers are required to store the contexts of threads.
  • #24: Hiding memory latency by multi-threading. The Cell uses a relatively static approach. The overlapping of computation and DMA transfer is explicitly specified by programmer.
  • #25: Fine-grained multi-threading can keep the PEs busy even the program has little ILP.
  • #27: The cache can still help.
  • #29: The address assignment and translation is done dynamically by hardware.
  • #30: The vector core should be larger than scalar core.
  • #32: From scalar to vector.
  • #33: From vector to threads.
  • #34: Warp can be grouped at run time by hardware. In this case it will be transparent to the programmer.
  • #35: The NVIDIA Fermi PE can do int and fp.
  • #37: We have ignored some architectural features of Fermi.  Noticeably the interconnection network is not discussed here. 
  • #38: These features are summarized by the paper of Michael Garland and David Kirk.
  • #42: The vector program use SSE as example. However, the &amp;quot;incps&amp;quot; is not an SSE instruction. It is used here to represent incrementation of the vector.
  • #43: Each thread uses its ID to locate its working data set.
  • #46: The scheduler tries to maintain load balancing among SMs.
  • #48: Numbers taken from an old paper on G80 architecture, but it should be similar to the GF100 architecture.
  • #49: The old architecture has 16 banks.
  • #54: It is a trend to use threads to hide vector width. The OpenCL applies the same programming model.
  • #55: It is arguable whether working on threads is more productive.
  • #60: This example assumes the two warp schedulers are decoupled. It is possible that they are coupled together, at the cost of hardware complexity.
  • #62: Assume the register file has one read port. The register file may need two read port to support instructions with 3 source operands, e.g. the Fused Multiply Add (FMA).
  • #72: 5 issue VLIW.
  • #73: The atomic unit is helpful in voting operation, e.g. histogram. 
  • #85: The figure is taken from 8800 GPU. See the paper of Samuel Williams for more detail.
  • #86: The number is obtained in 8800 GPU.
  • #87: The latency hiding is addressed in the PhD thesis of Samuel Williams.