SlideShare a Scribd company logo
GPU
JUNE6TH2019
7nm
251 sqmm
10.3 Billion Transistors
X16 PCIe® Gen 4.0
8 GB GDDR6 256b @14 Gbps
448 GB/S*
2560 Stream Processors
Up To 9.75 TFLOPs
*256 pin G6 * 14 Gbps *1B/8b = 448 GBS
®
Radeon Display Engine
New High Resolution HDR Displays
New Levels of Compression
Radeon Multi-Media Engine
Seamless Streaming
Improved Encoding
New Graphics RDNA Architecture
New Compute Units
Multilevel Cache
Streamlined Engine Infinity Fabric
PCIE
Gen 4
Display EngineMultimedia Engine
Geometry Processor
Shader Engine
Graphics Command Processor
ACE ACE
ACE ACE
HWS
DMA
64-bitMemoryController64-bitMemoryController
64-bitMemoryController64-bitMemoryController
Dual Compute Unit
Dual Compute Unit
Dual Compute Unit
Dual Compute Unit
Dual Compute Unit
L1
Prim
Unit
L1
Prim
Unit
Dual Compute Unit
Dual Compute Unit
Dual Compute Unit
Dual Compute Unit
Dual Compute Unit
Dual Compute Unit
Dual Compute Unit
Dual Compute Unit
Dual Compute Unit
Dual Compute Unit
L1
Prim
Unit
L1
Prim
Unit
Dual Compute Unit
Dual Compute Unit
Dual Compute Unit
Dual Compute Unit
Dual Compute Unit
Rasterizer
RasterizerRasterizer
Rasterizer
Shader Engine
RB
RB
RB
RB
RB
RB
RB
RB
RB
RB
RB
RB
RB
RB
RB
RB
L2L2L2L2L2L2L2L2
L2L2L2L2L2L2L2L2
High Fidelity Internal Color Depth
3 0 b p p C o l o r
Optimized for High Resolution HDR Displays
4 K 2 4 0 H z | S I N G L E C A B L E | 8 K 6 0 H z
Optimized for Head Mounted Displays
S i n g l e I O C o n n e c t i v i t y
Better Design For Power Efficiency
M u l t i P l a n e O v e r l a y P r o t o c o l W i t h L o w V o l t a g e M o d e
HDMI® 2.0 & DisplayPort 1.4 HDR
Display Stream Compression 1.2a
Direct Read of DCC Compressed Surfaces
H.264
MPEG-4
1080p600
4K150
1080p360
4K90
1080p360
4K60
1080p360
4K90
8K24
NEXT
GEN
4K90
8K24
IMPROVED ENCODING
N E W H D R / WC G E N C O D E ( H E V C )
8 K D E C O D E ( H E V C & V P 0 )
4 0 % E N C O D E R S P E E D U P S
YouTube
twitch
8b/10b
8b
8b/10b
Motivation
Radeon Architecture
New Compute Unit
Multi-Level Cache
Streamlined Engine
Results & Example
GPU Architecture Designed For
Gaming Performance & Efficiency
THE MOTIVATION BEHIND NAVI ARCHITECTURE
▪
▪
▪
▪
▪
▪
▪
▪
“NAVI”
▪
▪
▪
▪
2X INSTRUCTION RATE
Dual Schedulers
Dual Scalar Units
Dual SIMD32
SINGLE CYCLE ISSUE
Wave32 on SIMD32
ALU & LD/ST Unit
SFU Co-Execution
BYTES PER FLOP
128B Load/Store
64B Filter Rate
EXECUTION FLEXIBILITY
Wave64 Dual Issue
Cooperating CU Pair
LDS RTN
IDX DIRECT
VECTOR
MEM RTN
V INIT
DATA
▸
▸
▸
▸
▸
▸
▸
▸
▸
▸
▸
▸
Interleave low priority waves on long latency stall
Wave2 – Instruction M Wave0 – Instruction N+1 Wave0 – Instruction N
T 32-47 T 16-31 T 0-15T 48-63T 32-47 T 16-31 T 0-15T 48-63T 32-47 T 16-31 T 0-15T 48-63
GCN Instruction Execution
Wave64 on SIMD16 (4clk)
“GCN” –Fixed Interleave Of 4 Sets Of Threads Preventing Fine Grain Dynamic Compiler
Scheduling
Wave0
Instruction
N
T 0-31
RDNA Instruction Execution
Wave32 on SIMD32 (1clk)
“RDNA” -Single Cycle Instruction Issue Enabling Fine Grain Compiler Driven Scheduling To
Optimize For Prioritized Single Threaded Performance
T 0-31T 0-31
Wave0
Instruction
N+1
Wave2
Instruction
N+1
Time
Interleave lower priority waves on long latency stall
Both Designs Utilize Multithreading of different waves to achieve throughput and engine utilization
RDNA
2 Wave 32 ➔ 2 SIMD32
Instruction Issue ➔ 1 clock
CU ALU ➔ 100% Utilized
ILP unlocks up to 4x faster focused execution
Engage Machine Quickly By Uniformly Distributing Work To All ALUs
Optimize Efficiency And Latency By Preferring Highest Priority/Oldest Work
Extract Program ILP And Scheduling To Benefit From Data Locality
Utilize Multi-Threading Of Waves To Hide Remaining Latencies For Throughput
WORK LOAD EXAMPLE: 64 WORK-ITEMS ALU INTENSIVE CODE
0 31 0 31
SIMD 0 SIMD 1S
GCN
1 Wave64 ➔ SIMD16
Instruction Issue ➔ 4 clock
CU ALU ➔ 25% Utilized
Effective Throughput
0 15
SIMD 0
0 15
SIMD 1
0 15
SIMD 2
0 15
SIMD 3S
R
▪
▪
▪
▪
▪
▪
▪
▪
▸
▸
▸
SIMD32
Wave32
SIMD32
Wave64
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
s_add_i32 s0, s1, s2
…
…
…
v_mul_f32 v0, v1, s0
… (simd busy 4 cycles)
…
…
v_add_f32 v5, v4, v3
…
…
…
v_sub_f32 v6, v7, v0
…
…
…
s_add_i32 s0, s1, s2
v_mul_f32 v0, v1, s0
v_add_f32 v5, v4, v3
v_sub_f32 v6, v7, v0
s_add_i32 s0, s1, s2
… (salu dependency stall on S0)
v_mul_f32 v0, v1, s0
v_add_f32 v5, v4, v3
… (valu dependency stall on V0)
…
…
v_sub_f32 v6, v7, v0
s_add_i32 s0, s1, s2
… (salu dependency stall on S0)
v_mul_f32 v0, v1, s0 (lo)
v_mul_f32 v0, v1, s0 (hi)
v_add_f32 v5, v4, v3 (lo)
v_add_f32 v5, v4, v3 (hi)
… (valu dependency stall on V0 lo)
v_sub_f32 v6, v7, v0 (lo)
v_sub_f32 v6, v7, v0 (hi)
SHORTEST
WAVE ISSUE
LATENCY
44%
REDUCTION IN
ISSUE CYCLES
▪
▪
▪
▪
▸
▸
ACCESS TO
2X
Registers
UP TO
2X
ALUs
Compute Unit 1
Compute Unit 0
ACCESS TO
4X
Cache Bandwidth
New L1 Level Cache
Improved Bandwidth Amplification
Reduced Latency and Power
Reduced Congestion at L2 Level
Infinity Fabric
PCIE
Gen 4
Display EngineMultimedia Engine
Geometry Processor
Shader Engine
Graphics Command
Processor
ACE ACE
ACE ACE
HWS
DMA
64-bitMemoryController64-bitMemoryController
64-bitMemoryController64-bitMemoryController
Dual Compute Unit
Dual Compute Unit
Dual Compute Unit
Dual Compute Unit
Dual Compute Unit
L1
Prim
Unit
L1
Prim
Unit
Dual Compute Unit
Dual Compute Unit
Dual Compute Unit
Dual Compute Unit
Dual Compute Unit
Dual Compute Unit
Dual Compute Unit
Dual Compute Unit
Dual Compute Unit
Dual Compute Unit
L1 Prim
Unit
L1
Prim
Unit
Dual Compute Unit
Dual Compute Unit
Dual Compute Unit
Dual Compute Unit
Dual Compute Unit
Rasterizer
RasterizerRasterizer
Rasterizer
Shader Engine
RB
RB
RB
RB
RB
RB
RB
RB
RB
RB
RB
RB
RB
RB
RB
RB
L2L2L2L2L2L2L2L2
L2L2L2L2L2L2L2L2
L0
L0
L0
L0
L0
L0
L0
L0
L0
L0
L0
L0
L0 L0
L0
L0
L0
L0
L0
L0
Unified LLC for GFX/ACE Pipes
Instruction Range Based Actions
OOO between R/W, L0, L1, L2, Mem
Reduced Latency and Power
Reduced Data Movement
WGP
WGP
WGP
WGP
WGP
WGP
WGP
WGP
WGP
WGP
WGP
WGP
WGP
WGP
WGP
WGP
WGP
WGP
WGP
WGP
SGPR
SGPR
Wave Buffers
Wave Buffers
SGPR
SIMD 1 VGPR
SIMD 1 VGPR
SGPR
Wave Buffers
Wave Buffers
L
D
S
SIMD 0 VGPR
SIMD 0 VGPR
Shader
Complex
PCIe® 4.0 Geometry
Async Compute
Command
Interfaces
Compressed Data
SOC Fabric
Rasterizer &
L1
Texture
DISPLAY
ENGINE
7nm "Navi" GPU - A GPU Built For Performance
0%
20%
40%
60%
80%
100%
See Endnotes "RX-325 and RX-362. Data based on AMD internal testing 6/1/2019.
See Endnotes RX-325, RX-358, and RX-365.
Slide data based on AMD internal testing 6/1/2019.
14 nm “Vega64” 7nm “Navi”
▸
▸
▸
▸
R
▸
▸
Vector Instruction IssueWaveId
One SIMD Instruction trace of oldest wave (12), next to oldest wave (13), etc
Waiting to be executed
Store ResultsSFU Instruction
Scalar Mem Instruction
™
7nm "Navi" GPU - A GPU Built For Performance

More Related Content

What's hot (20)

PDF
AMD: Where Gaming Begins
AMD
 
PDF
Delivering a new level of visual performance in an SoC AMD "Raven Ridge" APU
AMD
 
PDF
DNNのモデル特化ハードウェアを生成するオープンソースコンパイラNNgenのデモ
Shinya Takamaeda-Y
 
PPTX
AMD Radeon™ RX 5700 Series 7nm Energy-Efficient High-Performance GPUs
AMD
 
PPTX
Hot Chips: AMD Next Gen 7nm Ryzen 4000 APU
AMD
 
PPTX
Heterogeneous Integration with 3D Packaging
AMD
 
PPTX
3D V-Cache
AMD
 
PPTX
CPU / GPU高速化セミナー!性能モデルの理論と実践:理論編
Fixstars Corporation
 
PDF
AMD Ryzen CPU Zen Cores Architecture
Low Hong Chuan
 
PDF
Delivering the Future of High-Performance Computing
AMD
 
PPTX
AI Hardware Landscape 2021
Grigory Sapunov
 
PPTX
Supermicro Servers with Micron DDR5 & SSDs: Accelerating Real World Workloads
Rebekah Rodriguez
 
PDF
GPU仮想化最前線 - KVMGTとvirtio-gpu -
zgock
 
PDF
TensorFlow計算グラフ最適化処理
Atsushi Nukariya
 
PDF
不揮発メモリ(NVDIMM)とLinuxの対応動向について
Yasunori Goto
 
PDF
20180729 Preferred Networksの機械学習クラスタを支える技術
Preferred Networks
 
PDF
"SRv6の現状と展望" ENOG53@上越
Kentaro Ebisawa
 
PDF
ZynqMPのブートとパワーマネージメント : (ZynqMP Boot and Power Management)
Mr. Vengineer
 
PDF
KVM環境におけるネットワーク速度ベンチマーク
VirtualTech Japan Inc.
 
PDF
Unbound/NSD最新情報(OSC 2013 Tokyo/Spring)
Takashi Takizawa
 
AMD: Where Gaming Begins
AMD
 
Delivering a new level of visual performance in an SoC AMD "Raven Ridge" APU
AMD
 
DNNのモデル特化ハードウェアを生成するオープンソースコンパイラNNgenのデモ
Shinya Takamaeda-Y
 
AMD Radeon™ RX 5700 Series 7nm Energy-Efficient High-Performance GPUs
AMD
 
Hot Chips: AMD Next Gen 7nm Ryzen 4000 APU
AMD
 
Heterogeneous Integration with 3D Packaging
AMD
 
3D V-Cache
AMD
 
CPU / GPU高速化セミナー!性能モデルの理論と実践:理論編
Fixstars Corporation
 
AMD Ryzen CPU Zen Cores Architecture
Low Hong Chuan
 
Delivering the Future of High-Performance Computing
AMD
 
AI Hardware Landscape 2021
Grigory Sapunov
 
Supermicro Servers with Micron DDR5 & SSDs: Accelerating Real World Workloads
Rebekah Rodriguez
 
GPU仮想化最前線 - KVMGTとvirtio-gpu -
zgock
 
TensorFlow計算グラフ最適化処理
Atsushi Nukariya
 
不揮発メモリ(NVDIMM)とLinuxの対応動向について
Yasunori Goto
 
20180729 Preferred Networksの機械学習クラスタを支える技術
Preferred Networks
 
"SRv6の現状と展望" ENOG53@上越
Kentaro Ebisawa
 
ZynqMPのブートとパワーマネージメント : (ZynqMP Boot and Power Management)
Mr. Vengineer
 
KVM環境におけるネットワーク速度ベンチマーク
VirtualTech Japan Inc.
 
Unbound/NSD最新情報(OSC 2013 Tokyo/Spring)
Takashi Takizawa
 

Similar to 7nm "Navi" GPU - A GPU Built For Performance (20)

PDF
Experiences with Power 9 at A*STAR CRC
Ganesan Narayanasamy
 
PDF
Fujitsu Lifebook LH532 DA0FJ8MB6F0 Schematic Diagram.pdf
AgnesCiyus
 
PDF
計算力学シミュレーションに GPU は役立つのか?
Shinnosuke Furuya
 
PDF
Snake Game on FPGA in Verilog
Krishnajith S S
 
PDF
#Riverflow2 d gpu tests 2019
Cheer Chain Enterprise Co., Ltd.
 
PPSX
Gcn performance ftw by stephan hodes
AMD Developer Central
 
PPTX
GPU Accelerated Virtual Desktop Infrastructure (VDI) on OpenStack
Brian Schott
 
PDF
Kauli SSPにおけるVyOSの導入事例
Kazuhito Ohkawa
 
PDF
x86_64 Hardware Deep dive
Naoto MATSUMOTO
 
PDF
Storage & Backup solutions on virtual VAX and Alpha
Saulius Krasückas
 
PDF
Embedded Recipes 2019 - Introduction to JTAG debugging
Anne Nicolas
 
PDF
Latest HPC News from NVIDIA
inside-BigData.com
 
PDF
NVIDIA GPUs Power HPC & AI Workloads in Cloud with Univa
inside-BigData.com
 
PDF
Dv5 amd
David Reyes
 
PDF
Building a Big Data Machine Learning Platform
Cliff Click
 
PDF
Hp dv6 7000 goya balen 11254-3
JosPinaya
 
PPT
POLYTEDA PowerDRC/LVS overview
Alexander Grudanov
 
PDF
Amd epyc update_gdep_xilinx_ai_web_seminar_20201028
ssuser5b12d1
 
PDF
Best practices for optimizing Red Hat platforms for large scale datacenter de...
Jeremy Eder
 
PDF
turnip: Update on Open Source Vulkan Driver for Adreno GPUs
Igalia
 
Experiences with Power 9 at A*STAR CRC
Ganesan Narayanasamy
 
Fujitsu Lifebook LH532 DA0FJ8MB6F0 Schematic Diagram.pdf
AgnesCiyus
 
計算力学シミュレーションに GPU は役立つのか?
Shinnosuke Furuya
 
Snake Game on FPGA in Verilog
Krishnajith S S
 
#Riverflow2 d gpu tests 2019
Cheer Chain Enterprise Co., Ltd.
 
Gcn performance ftw by stephan hodes
AMD Developer Central
 
GPU Accelerated Virtual Desktop Infrastructure (VDI) on OpenStack
Brian Schott
 
Kauli SSPにおけるVyOSの導入事例
Kazuhito Ohkawa
 
x86_64 Hardware Deep dive
Naoto MATSUMOTO
 
Storage & Backup solutions on virtual VAX and Alpha
Saulius Krasückas
 
Embedded Recipes 2019 - Introduction to JTAG debugging
Anne Nicolas
 
Latest HPC News from NVIDIA
inside-BigData.com
 
NVIDIA GPUs Power HPC & AI Workloads in Cloud with Univa
inside-BigData.com
 
Dv5 amd
David Reyes
 
Building a Big Data Machine Learning Platform
Cliff Click
 
Hp dv6 7000 goya balen 11254-3
JosPinaya
 
POLYTEDA PowerDRC/LVS overview
Alexander Grudanov
 
Amd epyc update_gdep_xilinx_ai_web_seminar_20201028
ssuser5b12d1
 
Best practices for optimizing Red Hat platforms for large scale datacenter de...
Jeremy Eder
 
turnip: Update on Open Source Vulkan Driver for Adreno GPUs
Igalia
 
Ad

More from AMD (17)

PPTX
AMD EPYC Family World Record Performance Summary Mar 2022
AMD
 
PPTX
AMD EPYC Family of Processors World Record
AMD
 
PPTX
AMD EPYC Family of Processors World Record
AMD
 
PPTX
AMD EPYC World Records
AMD
 
PPTX
AMD EPYC 7002 World Records
AMD
 
PPTX
AMD EPYC 7002 World Records
AMD
 
PPTX
AMD Chiplet Architecture for High-Performance Server and Desktop Products
AMD
 
PPTX
AMD EPYC 100 World Records and Counting
AMD
 
PPTX
AMD EPYC 7002 Launch World Records
AMD
 
PPTX
AMD Next Horizon
AMD
 
PPTX
AMD Next Horizon
AMD
 
PDF
AMD Next Horizon
AMD
 
PDF
Race to Reality: The Next Billion-People Market Opportunity
AMD
 
PDF
AMD and the new “Zen” High Performance x86 Core at Hot Chips 28
AMD
 
PDF
GPU Compute in Medical and Print Imaging
AMD
 
PPTX
Enabling ARM® Server Technology for the Datacenter
AMD
 
PPTX
Lessons From MineCraft: Building the Right SMB Network
AMD
 
AMD EPYC Family World Record Performance Summary Mar 2022
AMD
 
AMD EPYC Family of Processors World Record
AMD
 
AMD EPYC Family of Processors World Record
AMD
 
AMD EPYC World Records
AMD
 
AMD EPYC 7002 World Records
AMD
 
AMD EPYC 7002 World Records
AMD
 
AMD Chiplet Architecture for High-Performance Server and Desktop Products
AMD
 
AMD EPYC 100 World Records and Counting
AMD
 
AMD EPYC 7002 Launch World Records
AMD
 
AMD Next Horizon
AMD
 
AMD Next Horizon
AMD
 
AMD Next Horizon
AMD
 
Race to Reality: The Next Billion-People Market Opportunity
AMD
 
AMD and the new “Zen” High Performance x86 Core at Hot Chips 28
AMD
 
GPU Compute in Medical and Print Imaging
AMD
 
Enabling ARM® Server Technology for the Datacenter
AMD
 
Lessons From MineCraft: Building the Right SMB Network
AMD
 
Ad

Recently uploaded (20)

PPTX
The Future of AI & Machine Learning.pptx
pritsen4700
 
PPTX
What-is-the-World-Wide-Web -- Introduction
tonifi9488
 
PDF
OFFOFFBOX™ – A New Era for African Film | Startup Presentation
ambaicciwalkerbrian
 
PDF
Tea4chat - another LLM Project by Kerem Atam
a0m0rajab1
 
PPTX
cloud computing vai.pptx for the project
vaibhavdobariyal79
 
PDF
The Future of Artificial Intelligence (AI)
Mukul
 
PPTX
Agentic AI in Healthcare Driving the Next Wave of Digital Transformation
danielle hunter
 
PPTX
Agile Chennai 18-19 July 2025 | Workshop - Enhancing Agile Collaboration with...
AgileNetwork
 
PDF
GDG Cloud Munich - Intro - Luiz Carneiro - #BuildWithAI - July - Abdel.pdf
Luiz Carneiro
 
PDF
Responsible AI and AI Ethics - By Sylvester Ebhonu
Sylvester Ebhonu
 
PDF
Trying to figure out MCP by actually building an app from scratch with open s...
Julien SIMON
 
PPTX
OA presentation.pptx OA presentation.pptx
pateldhruv002338
 
PDF
How Open Source Changed My Career by abdelrahman ismail
a0m0rajab1
 
PDF
A Strategic Analysis of the MVNO Wave in Emerging Markets.pdf
IPLOOK Networks
 
PDF
AI Unleashed - Shaping the Future -Starting Today - AIOUG Yatra 2025 - For Co...
Sandesh Rao
 
PPTX
Applied-Statistics-Mastering-Data-Driven-Decisions.pptx
parmaryashparmaryash
 
PDF
Google I/O Extended 2025 Baku - all ppts
HusseinMalikMammadli
 
PPTX
Simple and concise overview about Quantum computing..pptx
mughal641
 
PDF
Peak of Data & AI Encore - Real-Time Insights & Scalable Editing with ArcGIS
Safe Software
 
PPTX
Dev Dives: Automate, test, and deploy in one place—with Unified Developer Exp...
AndreeaTom
 
The Future of AI & Machine Learning.pptx
pritsen4700
 
What-is-the-World-Wide-Web -- Introduction
tonifi9488
 
OFFOFFBOX™ – A New Era for African Film | Startup Presentation
ambaicciwalkerbrian
 
Tea4chat - another LLM Project by Kerem Atam
a0m0rajab1
 
cloud computing vai.pptx for the project
vaibhavdobariyal79
 
The Future of Artificial Intelligence (AI)
Mukul
 
Agentic AI in Healthcare Driving the Next Wave of Digital Transformation
danielle hunter
 
Agile Chennai 18-19 July 2025 | Workshop - Enhancing Agile Collaboration with...
AgileNetwork
 
GDG Cloud Munich - Intro - Luiz Carneiro - #BuildWithAI - July - Abdel.pdf
Luiz Carneiro
 
Responsible AI and AI Ethics - By Sylvester Ebhonu
Sylvester Ebhonu
 
Trying to figure out MCP by actually building an app from scratch with open s...
Julien SIMON
 
OA presentation.pptx OA presentation.pptx
pateldhruv002338
 
How Open Source Changed My Career by abdelrahman ismail
a0m0rajab1
 
A Strategic Analysis of the MVNO Wave in Emerging Markets.pdf
IPLOOK Networks
 
AI Unleashed - Shaping the Future -Starting Today - AIOUG Yatra 2025 - For Co...
Sandesh Rao
 
Applied-Statistics-Mastering-Data-Driven-Decisions.pptx
parmaryashparmaryash
 
Google I/O Extended 2025 Baku - all ppts
HusseinMalikMammadli
 
Simple and concise overview about Quantum computing..pptx
mughal641
 
Peak of Data & AI Encore - Real-Time Insights & Scalable Editing with ArcGIS
Safe Software
 
Dev Dives: Automate, test, and deploy in one place—with Unified Developer Exp...
AndreeaTom
 

7nm "Navi" GPU - A GPU Built For Performance

  • 2. 7nm 251 sqmm 10.3 Billion Transistors X16 PCIe® Gen 4.0 8 GB GDDR6 256b @14 Gbps 448 GB/S* 2560 Stream Processors Up To 9.75 TFLOPs *256 pin G6 * 14 Gbps *1B/8b = 448 GBS
  • 3. ®
  • 4. Radeon Display Engine New High Resolution HDR Displays New Levels of Compression Radeon Multi-Media Engine Seamless Streaming Improved Encoding New Graphics RDNA Architecture New Compute Units Multilevel Cache Streamlined Engine Infinity Fabric PCIE Gen 4 Display EngineMultimedia Engine Geometry Processor Shader Engine Graphics Command Processor ACE ACE ACE ACE HWS DMA 64-bitMemoryController64-bitMemoryController 64-bitMemoryController64-bitMemoryController Dual Compute Unit Dual Compute Unit Dual Compute Unit Dual Compute Unit Dual Compute Unit L1 Prim Unit L1 Prim Unit Dual Compute Unit Dual Compute Unit Dual Compute Unit Dual Compute Unit Dual Compute Unit Dual Compute Unit Dual Compute Unit Dual Compute Unit Dual Compute Unit Dual Compute Unit L1 Prim Unit L1 Prim Unit Dual Compute Unit Dual Compute Unit Dual Compute Unit Dual Compute Unit Dual Compute Unit Rasterizer RasterizerRasterizer Rasterizer Shader Engine RB RB RB RB RB RB RB RB RB RB RB RB RB RB RB RB L2L2L2L2L2L2L2L2 L2L2L2L2L2L2L2L2
  • 5. High Fidelity Internal Color Depth 3 0 b p p C o l o r Optimized for High Resolution HDR Displays 4 K 2 4 0 H z | S I N G L E C A B L E | 8 K 6 0 H z Optimized for Head Mounted Displays S i n g l e I O C o n n e c t i v i t y Better Design For Power Efficiency M u l t i P l a n e O v e r l a y P r o t o c o l W i t h L o w V o l t a g e M o d e HDMI® 2.0 & DisplayPort 1.4 HDR Display Stream Compression 1.2a Direct Read of DCC Compressed Surfaces
  • 6. H.264 MPEG-4 1080p600 4K150 1080p360 4K90 1080p360 4K60 1080p360 4K90 8K24 NEXT GEN 4K90 8K24 IMPROVED ENCODING N E W H D R / WC G E N C O D E ( H E V C ) 8 K D E C O D E ( H E V C & V P 0 ) 4 0 % E N C O D E R S P E E D U P S YouTube twitch 8b/10b 8b 8b/10b
  • 7. Motivation Radeon Architecture New Compute Unit Multi-Level Cache Streamlined Engine Results & Example GPU Architecture Designed For Gaming Performance & Efficiency
  • 8. THE MOTIVATION BEHIND NAVI ARCHITECTURE
  • 10. 2X INSTRUCTION RATE Dual Schedulers Dual Scalar Units Dual SIMD32 SINGLE CYCLE ISSUE Wave32 on SIMD32 ALU & LD/ST Unit SFU Co-Execution BYTES PER FLOP 128B Load/Store 64B Filter Rate EXECUTION FLEXIBILITY Wave64 Dual Issue Cooperating CU Pair
  • 11. LDS RTN IDX DIRECT VECTOR MEM RTN V INIT DATA
  • 13. Interleave low priority waves on long latency stall Wave2 – Instruction M Wave0 – Instruction N+1 Wave0 – Instruction N T 32-47 T 16-31 T 0-15T 48-63T 32-47 T 16-31 T 0-15T 48-63T 32-47 T 16-31 T 0-15T 48-63 GCN Instruction Execution Wave64 on SIMD16 (4clk) “GCN” –Fixed Interleave Of 4 Sets Of Threads Preventing Fine Grain Dynamic Compiler Scheduling Wave0 Instruction N T 0-31 RDNA Instruction Execution Wave32 on SIMD32 (1clk) “RDNA” -Single Cycle Instruction Issue Enabling Fine Grain Compiler Driven Scheduling To Optimize For Prioritized Single Threaded Performance T 0-31T 0-31 Wave0 Instruction N+1 Wave2 Instruction N+1 Time Interleave lower priority waves on long latency stall Both Designs Utilize Multithreading of different waves to achieve throughput and engine utilization
  • 14. RDNA 2 Wave 32 ➔ 2 SIMD32 Instruction Issue ➔ 1 clock CU ALU ➔ 100% Utilized ILP unlocks up to 4x faster focused execution Engage Machine Quickly By Uniformly Distributing Work To All ALUs Optimize Efficiency And Latency By Preferring Highest Priority/Oldest Work Extract Program ILP And Scheduling To Benefit From Data Locality Utilize Multi-Threading Of Waves To Hide Remaining Latencies For Throughput WORK LOAD EXAMPLE: 64 WORK-ITEMS ALU INTENSIVE CODE 0 31 0 31 SIMD 0 SIMD 1S GCN 1 Wave64 ➔ SIMD16 Instruction Issue ➔ 4 clock CU ALU ➔ 25% Utilized Effective Throughput 0 15 SIMD 0 0 15 SIMD 1 0 15 SIMD 2 0 15 SIMD 3S R
  • 16. 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 s_add_i32 s0, s1, s2 … … … v_mul_f32 v0, v1, s0 … (simd busy 4 cycles) … … v_add_f32 v5, v4, v3 … … … v_sub_f32 v6, v7, v0 … … … s_add_i32 s0, s1, s2 v_mul_f32 v0, v1, s0 v_add_f32 v5, v4, v3 v_sub_f32 v6, v7, v0 s_add_i32 s0, s1, s2 … (salu dependency stall on S0) v_mul_f32 v0, v1, s0 v_add_f32 v5, v4, v3 … (valu dependency stall on V0) … … v_sub_f32 v6, v7, v0 s_add_i32 s0, s1, s2 … (salu dependency stall on S0) v_mul_f32 v0, v1, s0 (lo) v_mul_f32 v0, v1, s0 (hi) v_add_f32 v5, v4, v3 (lo) v_add_f32 v5, v4, v3 (hi) … (valu dependency stall on V0 lo) v_sub_f32 v6, v7, v0 (lo) v_sub_f32 v6, v7, v0 (hi) SHORTEST WAVE ISSUE LATENCY 44% REDUCTION IN ISSUE CYCLES
  • 18. ACCESS TO 2X Registers UP TO 2X ALUs Compute Unit 1 Compute Unit 0 ACCESS TO 4X Cache Bandwidth
  • 19. New L1 Level Cache Improved Bandwidth Amplification Reduced Latency and Power Reduced Congestion at L2 Level Infinity Fabric PCIE Gen 4 Display EngineMultimedia Engine Geometry Processor Shader Engine Graphics Command Processor ACE ACE ACE ACE HWS DMA 64-bitMemoryController64-bitMemoryController 64-bitMemoryController64-bitMemoryController Dual Compute Unit Dual Compute Unit Dual Compute Unit Dual Compute Unit Dual Compute Unit L1 Prim Unit L1 Prim Unit Dual Compute Unit Dual Compute Unit Dual Compute Unit Dual Compute Unit Dual Compute Unit Dual Compute Unit Dual Compute Unit Dual Compute Unit Dual Compute Unit Dual Compute Unit L1 Prim Unit L1 Prim Unit Dual Compute Unit Dual Compute Unit Dual Compute Unit Dual Compute Unit Dual Compute Unit Rasterizer RasterizerRasterizer Rasterizer Shader Engine RB RB RB RB RB RB RB RB RB RB RB RB RB RB RB RB L2L2L2L2L2L2L2L2 L2L2L2L2L2L2L2L2 L0 L0 L0 L0 L0 L0 L0 L0 L0 L0 L0 L0 L0 L0 L0 L0 L0 L0 L0 L0
  • 20. Unified LLC for GFX/ACE Pipes Instruction Range Based Actions OOO between R/W, L0, L1, L2, Mem Reduced Latency and Power Reduced Data Movement WGP WGP WGP WGP WGP WGP WGP WGP WGP WGP WGP WGP WGP WGP WGP WGP WGP WGP WGP WGP
  • 21. SGPR SGPR Wave Buffers Wave Buffers SGPR SIMD 1 VGPR SIMD 1 VGPR SGPR Wave Buffers Wave Buffers L D S SIMD 0 VGPR SIMD 0 VGPR
  • 22. Shader Complex PCIe® 4.0 Geometry Async Compute Command Interfaces Compressed Data SOC Fabric Rasterizer & L1 Texture DISPLAY ENGINE
  • 24. 0% 20% 40% 60% 80% 100% See Endnotes "RX-325 and RX-362. Data based on AMD internal testing 6/1/2019.
  • 25. See Endnotes RX-325, RX-358, and RX-365. Slide data based on AMD internal testing 6/1/2019. 14 nm “Vega64” 7nm “Navi”
  • 26. ▸ ▸ ▸ ▸ R ▸ ▸ Vector Instruction IssueWaveId One SIMD Instruction trace of oldest wave (12), next to oldest wave (13), etc Waiting to be executed Store ResultsSFU Instruction Scalar Mem Instruction
  • 27.