introducing-the-versal-architecture
introducing-the-versal-architecture
Presented By
Sumit Shah
Director of Silicon Product Marketing and Management
October 2, 2018
Performance
vs. VAZ11-780 40 YEARS OF PROCESSOR PERFORMANCE Irregular
Safety Processing,
2X /
100,000
3.5 Years
? or Latency-Critical data types,
Workloads instruction sets,
2X /
data operation
6 Years Amdahls
10,000 Law
End of
Dennard
Scaling Whole
1000
2X / Domain Specific Application Sensor Fusion,
Parallelism
1.5 Years (e.g., Video, ML) Pre-Processing,
100 Data Aggregation
RISC
10 2X / Complex
3.5 Years Algorithms,
Full Linux “Services”
CISC
1980 1985 1990 1995 2000 2005 2010 2015
Source: John Hennessy and David Patterson, Computer Architecture: A Quantitative Approach, 6/e 2018
>> 2
© Copyright 2018 Xilinx
Need for a New Programming Paradigm
Ecosystem of
Libraries
Need a Scalable,
Software Developer Unified Platform Hardware Developer
Needs Agility and Abstraction Needs Flexibility to Optimize for Performance/Power
Modify,
Design,
Add Code
>> 3
© Copyright 2018 Xilinx
New Device Category: Adaptive Compute Acceleration Platform
COMPUTE ACCELERATION
PLATFORM
ADAPTIVE Development Tools
HW/SW Libraries
Diverse Workloads in Run-time Stack
Milliseconds
˃ Heterogeneous Acceleration
˃ For Any Application
˃ For Any Developer
>> 5
© Copyright 2018 Xilinx
Breakthrough Performance for Cloud, Network, and Edge
Intelligent Engines
Scalar Engines • AI Compute
• Platform Control • Diverse DSP workloads
• Edge Compute
Network-on-Chip
Protocol Engines • Guaranteed Bandwidth
• Integrated 600G cores • Enables SW Programmability
• 4X encrypted bandwidth
>> 7
© Copyright 2018 Xilinx
Platform Management Controller
Bringing the Platform to Life & Keeping it Safe & Secure
Boot & Configuration
10s of
˃ Boots the platform in milliseconds (any engine first) Milliseconds
˃ 8X faster dynamic reconfiguration DONE
Boot
Integrated Platform Interfaces & High Speed Debug
˃ Integrated flash, system & debug interfaces
BOOT & CONFIG SAFETY SECURITY DEBUG
˃ High-speed non-invasive, chip-wide debug
>> 8
© Copyright 2018 Xilinx
A Processor in Every Device
Diverse Use Models for Scalar Processing
>> 9
© Copyright 2018 Xilinx
The Arm Subsystem
Adaptable
Hardware Engines
>> 11
© Copyright 2018 Xilinx
Greater Compute Density for Any Workload
Re-Architected Hardware Fabric Tune for Power & Performance Adaptable to any Workload
˃ 4X density per logic block for more compute ˃ Three operating voltages to choose from ˃ Bit-level precision (1 → 1,000) for any algorithm
˃ Less external routing→ greater performance ˃ Balance power/performance for target app ˃ Improves ML efficiency (compression, pruning)
˃ Code and IP compatible with 16nm devices ˃ Equivalent to 3 speed grades in one device ˃ Forward-compatible to lower precision
neural networks, e.g., BNN
30% 20%
Lower More
Power Performance
VLOW VHIGH
For Any Workload
>> 12
© Copyright 2018 Xilinx
Intelligent Engines
Intelligent Engines
for Diverse Compute
DSP Engines
High-precision floating point & low latency
Granular control for customized data paths
AI Engines
High throughput, low latency, and power efficient
Ideal for AI inference and advanced signal processing
>> 13
© Copyright 2018 Xilinx
DSP Engines
Versatility and Granular Control of Data Path
>> 14
© Copyright 2018 Xilinx
Intelligent Engines
Massive AI Inference Throughput and Wireless Compute
Memory
Memory
Memory
AI AI AI
Core Core Core
Massive array of interconnected cores
˃ Instantiate multiple tiles (10s to 100s) for scalable compute
Memory
Memory
Memory
AI AI AI
Core Core Core
Terabytes/sec of interface bandwidth to other engines
˃ Direct, massive throughput to adaptable HW engines
Memory
Memory
Memory
˃ Implement core application with AI for “Whole App Acceleration” AI AI AI
Core Core Core
>> 15
© Copyright 2018 Xilinx
NoC for Ease of Use, Guaranteed Bandwidth, and
Power Efficiency
>> 16
© Copyright 2018 Xilinx
Adaptable Memory Hierarchy
The Right Memory for the Right Job local data memory
in AI engines
WORKLOADN
Increasing Bandwidth, Decreasing Density
1,000 Tb/s
Cache
LUTRAM
Distributed low-latency memory
Arm
100 Tb/s Block RAM & UltraRAM Cortex-R5
BRAM
BRAM
BRAM
BRAM
BRAM
BRAM
BRAM
BRAM
Embedded configurable SRAM
UltraRAM UltraRAM
UltraRAM UltraRAM
Cache
(New) Accelerator RAM
TCM Accelerator RAM
4 MB sharable across engines
OCM
10 Tb/s HBM
In-package DRAM
MIPI
PCIe & Network
DDR HBM SerDes LVDS
DDR External Memory CCIX Cores
GPIO
DDR4-3200; LPDDR4-4266
1 Tb/s
>> 17
© Copyright 2018 Xilinx
Introducing the “Integrated Shell”
Scalar Engines Adaptable Engines Intelligent Engines
‘Shell’: Pre-Built Core Infrastructure & System Connectivity
˃ External host interface Arm
Dual-Core AI
>> 18
© Copyright 2018 Xilinx
Transceivers: Robust and Scalable Connectivity
>> 19
© Copyright 2018 Xilinx
Programmable I/O for Any Sensor, Interface, or Memory
3.2Gb/s DDR4
Server Class Density Per Channel
>> 20
© Copyright 2018 Xilinx
Versal Core Series Enables “Smart Cities”
Video Surveillance with Machine Learning
>> 21 Network
© Copyright 2018 Xilinx
For Any Developer
Embedded
Embedded Run-Time
Developers
Hardware
Vivado Design Suite
Developers
>> 22
© Copyright 2018 Xilinx
AI Core
Series
Prime
Series
DEVICE
CATEGORY
FPGA SoC ACAP
>> 24
© Copyright 2018 Xilinx
Announcing the First Two Series of the Versal
Portfolio
AI Core Series AI RF
Series
Breakthrough AI Inference Throughput
˃ Portfolio‘s highest compute and low latency inference AI Core
Series
˃ Optimized for cloud, networking, & autonomous applications
˃ For highest range of AI and workload acceleration AI Edge
Series
>> 25
© Copyright 2018 Xilinx
Versal AI Core Series
Highest AI Inference
Throughput
50 – 150 INT8 TOPs First Available
Device
VC1352 VC1502 VC1702 VC1802 VC1902
Intelligent Engines AI Engines 128 217 310 300 400
DSP Engines 928 1,312 1,272 1,600 1,968
Adaptable Engines System Logic Cells (K) 540 797 1,021 1,586 1,968
Accelerator RAM (Mb) 32 0 32 0 0
Scalable DDR
128b – 256b w/ECC
>> 26
© Copyright 2018 Xilinx
Versal Prime Series
6X Scalable
Logic Density First Available
Device
VM1102 VM1302 VM1402 VM1502 VM1802 VM2502 VM2602 VM2702 VM2902
Intelligent Engines DSP Engines 472 736 1,504 1,312 1,968 3,984 1,880 2,500 3,080
Adaptable Engines System Logic Cells (K) 352 572 1,002 797 1,968 2,030 1,263 1,805 2,154
Real-time Processing Unit Dual-core Arm Cortex-R5, 32KB/32KB L1 Cache, 256KB TCM w/ECC and 256KB OCM w/ECC
Foundational NoC Master / NoC Slave Ports 5 16 16 14 28 28 16 26 26
Platform
DDR Bus Widths 64 128 256 128 256 288 384 384 384
CCIX & PCIe® w/DMA (CPM) - - - 1 x Gen4x16, CCIX 1 x Gen4x16, CCIX 1 x Gen4x16, CCIX 1 x Gen4x16, CCIX 1 x Gen4x16, CCIX 1 x Gen4x16, CCIX
PCI Express® 1 x Gen4x8 2 x Gen4x8 2 x Gen4x8 4 x Gen4x8 4 x Gen4x8 1 x Gen4x8 1 x Gen4x8 2 x Gen4x8 2 x Gen4x8
Transceiver 12 24 24 44 44 44 52 76 92
>> 27
© Copyright 2018 Xilinx
Versal Roadmap
HBM
Memory
Premium Integration
AI RF
112G Serdes AI w/ Integrated RF
600G Cores
AI Core
AI Inference
Throughout
AI Edge
Lowest power AI
Prime
Broadest Application
>> 28
© Copyright 2018 Xilinx
Getting Started
Visit www.xilinx.com/versal
Check out the Media Kit
Watch ACAP Intro video
Subscribe to mailing list for the latest news
>> 29
© Copyright 2018 Xilinx
Key Take-Aways
Availability
Early Access Program for SW and tools
Devices Available 2H 2019
>> 30
© Copyright 2018 Xilinx
© Copyright 2018 Xilinx
Versal AI Core Series
Highest AI Inference
Throughput
50 – 150 INT8 TOPs First Available
Device
VC1352 VC1502 VC1702 VC1802 VC1902
Intelligent Engines AI Engines 128 217 310 300 400
DSP Engines 928 1,312 1,272 1,600 1,968
Adaptable Engines System Logic Cells (K) 540 797 1,021 1,586 1,968
Accelerator RAM (Mb) 32 0 32 0 0
Scalable DDR
128b – 256b w/ECC
>> 32
© Copyright 2018 Xilinx
Versal Prime Series
6X Scalable
Logic Density First Available
Device
VM1102 VM1302 VM1402 VM1502 VM1802 VM2502 VM2602 VM2702 VM2902
Intelligent Engines DSP Engines 472 736 1,504 1,312 1,968 3,984 1,880 2,500 3,080
Adaptable Engines System Logic Cells (K) 352 572 1,002 797 1,968 2,030 1,263 1,805 2,154
Real-time Processing Unit Dual-core Arm Cortex-R5, 32KB/32KB L1 Cache, 256KB TCM w/ECC and 256KB OCM w/ECC
Foundational NoC Master / NoC Slave Ports 5 16 16 14 28 28 16 26 26
Platform
DDR Bus Widths 64 128 256 128 256 288 384 384 384
CCIX & PCIe® w/DMA (CPM) - - - 1 x Gen4x16, CCIX 1 x Gen4x16, CCIX 1 x Gen4x16, CCIX 1 x Gen4x16, CCIX 1 x Gen4x16, CCIX 1 x Gen4x16, CCIX
PCI Express® 1 x Gen4x8 2 x Gen4x8 2 x Gen4x8 4 x Gen4x8 4 x Gen4x8 1 x Gen4x8 1 x Gen4x8 2 x Gen4x8 2 x Gen4x8
Transceiver (32G, 58G) 12, 0 24, 0 24, 0 44, 0 44, 0 16, 28 20, 32 32, 44 40, 52
>> 33
© Copyright 2018 Xilinx
Versal AI Core Series
VC1352 VC1502 VC1702 VC1802 VC1902
Adaptable Engines System Logic Cells (K) 540 797 1,021 1,586 1,968
LUTs 246,784 364,544 466,688 725,000 899,840
Distributed RAM (Mb) 8 11 14 22 27
Memory Total Block RAM (Mb) 18 19 29 28 34
UltraRAM (Mb) 42 60 113 91 130
Accelerator RAM (Mb) 32 0 32 0 0
Total SRAM Capacity (Mb) 92 80 174 120 164
Scalar Engines Application Processing Unit Dual-core Arm® Cortex-A72, 48KB/32KB L1 Cache w/ parity & ECC; 1MB L2 Cache w/ ECC
Real-time Processing Unit Dual-core Arm Cortex-R5, 32KB/32KB L1 Cache, and 256KB TCM w/ECC
Memory 256KB On-Chip Memory w/ECC
Connectivity Ethernet (x2); UART (x2); CAN-FD (x2); USB 2.0 (x1); SPI (x2); I2C (x2)
Intelligent Engines DSP Engines 472 736 1,504 1,312 1,968 3,984 1,880 2,500 3,080
Adaptable Engines System Logic Cells (K) 352 572 1,002 797 1,968 2,030 1,263 1,805 2,154
LUTs 161,024 261,376 457,984 364,544 899,840 927,872 577,536 825,000 984,576
Distributed RAM (Mb) 5 8 14 11 27 28 18 25 30
Memory Total Block RAM (Mb) 8 16 40 19 34 48 55 74 90
Total UltraRAM (Mb) 27 47 47 60 130 197 119 169 204
Total SRAM Capacity (Mb) 35 63 87 80 164 245 174 243 294
Scalar Engines Application Processing Unit Dual-core Arm® Cortex-A72, 48KB/32KB L1 Cache w/ parity & ECC; 1MB L2 Cache w/ ECC
Real-time Processing Unit Dual-core Arm Cortex-R5, 32KB/32KB L1 Cache, and 256KB TCM w/ECC
Memory 256KB On-Chip Memory w/ECC
Connectivity Ethernet (x2); USB 2.0 (x1); UART (x2); SPI (x2); I2C (x2); CAN-FD (x2)
DDR Bus Widths 64 128 256 128 256 288 384 384 384
DDR Memory Controllers 1 2 4 2 4 5 6 6 6
CCIX & PCIe® w/DMA (CPM) - - - 1 x Gen4x16, CCIX 1 x Gen4x16, CCIX 1 x Gen4x16, CCIX 1 x Gen4x16, CCIX 1 x Gen4x16, CCIX 1 x Gen4x16, CCIX
PCI Express® 1 x Gen4x8 2 x Gen4x8 2 x Gen4x8 4 x Gen4x8 4 x Gen4x8 1 x Gen4x8 1 x Gen4x8 2 x Gen4x8 2 x Gen4x8
Multirate Ethernet MAC 1 2 2 4 4 1 2 2 2
XPIO, HDIO, MIO XPIO, HDIO, MIO XPIO, HDIO, MIO XPIO, HDIO, MIO XPIO, HDIO, MIO XPIO, HDIO, MIO XPIO, HDIO, MIO XPIO, HDIO, MIO XPIO, HDIO, MIO
Package Footprint Package Dimensions
GTY, GTM GTY, GTM GTY, GTM GTY, GTM GTY, GTM GTY, GTM GTY, GTM GTY, GTM GTY, GTM
B1024 31x31 216, 22, 78, 12, 0 216, 44, 78, 16, 0 324, 44, 78, 16, 0
B1369 35x35 216, 44, 78, 24, 0 324, 44, 78, 24, 0 324, 44, 78, 24, 0
A1760 40x40 432, 22, 78, 24, 0 648, 22, 78, 24, 0 756, 22, 78, 20, 0
C1760 40x40 378, 44, 78, 44, 0 378, 44, 78, 44, 0 378, 44, 78, 20, 32 378, 44, 78, 24, 32 378, 44, 78, 24, 32
A2197 45x45 648, 44, 78, 44, 0 648, 44, 78, 16, 16
A2785 50x50 702, 44, 78, 16, 28 702, 44, 78, 20, 32 702, 44, 78, 32, 44 702, 44, 78, 40, 52
>> 35
© Copyright 2018 Xilinx