Accelerating ML Recommendation
Accelerating ML Recommendation
Roger Espasa, Nivard Aymerich, Allen Baum, Tom Berg, Jim Burr, Eric Hao, Jayesh Iyer, Miquel Izquierdo, Shankar Jayaratnam,
Darren Jones, Chris Klingner, Jin Kim, Stephen Lee, Marc Lupon, Grigorios Magklis, Bojan Maric, Rajib Nath, Mike Neilly, Duane
Northcutt, Bill Orner, Jose Renau, Gerard Reves, Xavier Reves, Tom Riordan, Pedro Sanchez, Sri Samudrala, Guillem Sole,
Raymond Tang, Tommy Thorn, Francisco Torres, Sebastia Tortella, Daniel Yau
The ET-SoC-1 has over a thousand RISC-V processors on a single TSMC 7nm chip, including:
• 1088 energy-efficient ET-Minion 64-bit RISC-V in-order cores each with a vector/tensor unit
• 4 high-performance ET-Maxion 64-bit RISC-V out-of-order cores
• >160 million bytes of on-chip SRAM
• Interfaces for large external memory with low-power LPDDR4x DRAM and eMMC FLASH
• PCIe x8 Gen4 and other common I/O interfaces
• Innovative low-power architecture and circuit techniques allows entire chip to
• Compute at peak rates of 100 to 200 TOPS
• Operate using under 20 watts for ML recommendation workloads
This general-purpose parallel-processing system on a chip can be used for many parallelizable workloads
But today, we want to show why it is a compelling solution for Machine Learning Recommendation (inference)
• ML Recommendation is one of the hardest and most important problems for many hyperscale data centers
2
Requirements and challenges for ML Recommendation in large datacenters
Most inferencing workloads for recommendation systems in large data centers are run on x86 servers
Often these servers have an available slot for an accelerator card, but it needs to meet key requirements:
• 100 TOPS to 1000 TOPS peak rates to provide better performance than the x86 host CPU alone
• Limited power budget per card, perhaps 75 to 120 watts, and must be air-cooled[1]
• Strong support for Int8, but must also support FP16 and FP32 data types[1,2]
• ~100 GB of memory capacity on the accelerator card to hold most embeddings, weights and activations[3]
• ~100 MB of on-die memory[5]
• Handle both dense and sparse compute workloads. Embedding look-up is sparse matrix by dense matrix
multiplication[5]
• Be programmable to deal with rapidly evolving workloads[1], rather than depending on overly-specialized
hardware[4,5]
3
Esperanto’s approach is different... and we think better for ML Recommendation
One Giant Hot Chip uses up power budget Use multiple low-power chips that still fit within power budget
Limited I/O pin budget limits memory BW Performance, pins, memory, bandwidth scale up with more chips
Dependence on systolic array multipliers Thousands of general-purpose RISC-V/tensor cores
• Great for high ResNet50 score • Far more programmable than overly-specialized (eg systolic) hw
• Not so good with large sparse memory • Thousands of threads help with large sparse memory latency
Only a handful (10-20) of CPU cores Full parallelism of thousands of cores always available
• Limited parallelism with CPU cores when Low-voltage operation of transistors is more energy-efficient
problem doesn’t fit onto array multiplier • Lower voltage operation also reduces power
Standard voltage: Not energy efficient • Requires both circuit and architecture innovations
Challenge: How to put the highest ML Recommendation performance
onto a single accelerator card with a 120-watt limit?
4
Could fit six chips on 120W card, if each took less than 20 watts
Assumed half of 20W power for 1K RISC-V cores, so only 10 mW per core!
275 W
1 chip
0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
Operating voltage for the 1K ET-Minion RISC-V/Tensor cores
6
ET-Minion is an Energy-Efficient RISC-V CPU with a Vector/Tensor Unit
ET-MINION IS A CUSTOM BUILT 64-BIT RISC-V PROCESSOR 32-bit & 16-bit FMA Bypass TIMA TIMA VPU RF T0/T1 Trans
ROMs
• In-order pipeline with low gates/stage to improve MHz at low voltages 32-bit & 16-bit FMA Bypass TIMA TIMA VPU RF T0/T1
Vector/Tensor Unit
• Architecture and circuits optimized to enable low-voltage operation 32-bit & 16-bit FMA Bypass TIMA TIMA VPU RF T0/T1 Trans
• Two hardware threads of execution ROMs
32-bit & 16-bit FMA Bypass TIMA VPU RF T0/T1
•
TIMA
Software configurable L1 data-cache and/or scratchpad
32-bit & 16-bit FMA Bypass TIMA TIMA VPU RF T0/T1 Trans
ROMs
ML OPTIMIZED VECTOR/TENSOR UNIT 32-bit & 16-bit FMA Bypass TIMA TIMA VPU RF T0/T1
• 512-bit wide integer per cycle 32-bit & 16-bit FMA Bypass TIMA TIMA VPU RF T0/T1 Trans
• 128 8-bit integer operations per cycle, accumulates to 32-bit Int 32-bit & 16-bit FMA Bypass TIMA TIMA VPU RF T0/T1
ROMs
RISC-V Integer
• 16 32-bit single precision operations per cycle RISC-V Integer Pipeline
• 32 16-bit half precision operations per cycle Data Cache Control
• New multi-cycle Tensor Instructions (including D-tags)
L1 Data-Cache/Scratchpad
Front
End
• Can run for up to 512 cycles (up to 64K operations) with one tensor instruction
DC bank 0 DC bank 1 DC bank 2 DC bank 3
• Reduces instruction fetch bandwidth and reduces power
• RISC-V integer pipeline put to sleep during tensor instructions ET-Minion RISC-V Core and Tensor/Vector unit
• Vector transcendental instructions optimized for low-voltage operation
to improve energy-efficiency
OPERATING RANGE: 300 MHz TO 2 GHz
Optimized for energy-efficient ML operations. Each ET-Minion can deliver peak of 128 Int8 GOPS per GHz
7
8 ET-Minions form a “Neighborhood”
Instruction Cache
32 KB shared
• Take advantage that almost always running highly parallel code
ET-Minion 1 ET-Minion 5
8
32 ET-Minion CPUs and 4 MB Memory form a “Minion Shire”
m3 m0 Bank0
m7 m4 (1MB) SOFTWARE CONFIGURABLE MEMORY HIERARCHY
m11 m8 Bank1 • L1 data cache can also be configured as scratchpad
m15 m12
4x4
(1MB) Mesh • Four 1MB SRAM banks can be partitioned as private L2,
stop shared L3 and scratchpad
m19 m16 xbar Bank2
m23 m20 (1MB)
m27 m24 Minion
m31 m28
Bank3
Shire SHIRES CONNECTED WITH MESH NETWORK
(1MB)
9
Shires are connected to each other and to external memory through Mesh Network
10
ET-SoC-1: Full chip internal block diagram
34 MINION SHIRES
PCIe logic DFT/eFuses
ET-Maxion ET-Maxion
• 1088 ET-Minions
ET-Maxion ET-Maxion
8 MEMORY SHIRES
LPDDR4x
• LPDDR4x DRAM
LPDDR4x
16-bits 16-bits
controller controller
16-bits
LPDDR4x LPDDR4x
16-bits
controllers
controller controller
PCIe SHIRE
LPDDR4x LPDDR4x
16-bits 16-bits
controller controller
LPDDR4x LPDDR4x
160 million bytes of
16-bits 16-bits
controller controller
on-die SRAM
MEMORY SHIRE MEMORY SHIRE
LPDDR4x LPDDR4x
x8 PCIe Gen4
16-bits 16-bits
controller controller
16-bits
LPDDR4x
controller
LPDDR4x
controller
16-bits Secure Root of Trust
MEMORY SHIRE MEMORY SHIRE
EXTERNAL IO
• SMBus
• Serial – I2C/SPI/UART
• GPIO
• FLASH
12
Card with six ET-SOC-1 chips for large sparse ML Recommendation models
• Esperanto’s low-power technology allows six Esperanto chips and 24 DRAM chips to fit into120-Watt power budget of customer’s PCIe card
• A single ML model on one accelerator card can use up to 192 GB of low-cost LPDDR4x DRAM with up to 822 GB/s of memory bandwidth
• Over 6K cores with 12K threads handles memory latency on 96 memory channels and performs well for ML Recommendation (and other) tasks
4 16-bit channels
4 16-bit channels
4 16-bit channels
4 16-bit channels
4 16-bit channels
4 16-bit channels
4 16-bit channels
4 16-bit channels
4 16-bit channels
4 16-bit channels
4 16-bit channels
4 16-bit channels
4 16-bit channels
4 16-bit channels
4 16-bit channels
4 16-bit channels
4 16-bit channels
4 16-bit channels
4 16-bit channels
4 16-bit channels
4 16-bit channels
4 16-bit channels
4 16-bit channels
4 16-bit channels
LPDDR4x
LPDDR4x
LPDDR4x
LPDDR4x
LPDDR4x
LPDDR4x
LPDDR4x
LPDDR4x
LPDDR4x
LPDDR4x
LPDDR4x
LPDDR4x
LPDDR4x
LPDDR4x
LPDDR4x
LPDDR4x
LPDDR4x
LPDDR4x
LPDDR4x
LPDDR4x
LPDDR4x
LPDDR4x
LPDDR4x
LPDDR4x
96 16- bit channels
PCIe switch
PCIe card interface
13
Six ET-SoC-1 chips fit on an OCP Glacier Point v2 Card
Peak performance > 800 TOPS8 when all ET-Minions on six chips are operating at 1 GHz
14
ET-SoC-1 can be deployed at scale in existing OCP Data Centers
x2 x4 x8
Top of Rack Switch
Yosemite v2
Power Shelf
Power Shelf
OCP Glacier Point v2 Yosemite v2 Sled Yosemite v2 Cubby Rack with Yosemite v2 Example OCP Data Center
accelerator card holds: holds: holds: holds:
holds: • 2 Glacier Point v2 • 4 Yosemite Sleds 8 Yosemite v2 Cubbies @ 30 sq. ft. per OCP rack [7]
• 6 Esperanto accelerator cards • 8 accelerator cards 64 accelerator cards Estimated 4K-20K racks / data center
ET-SoC-1 chips • 12 ET-SoC-1 chips • 48 ET-SoC-1 chips 384 ET-SoC-1 chips Millions of ET-SoC-1 chips
15
Software: Esperanto Supports C++ / PyTorch and Common ML Frameworks
…..
16
ML Recommendation performance per card comparisons
Based on MLPerf Deep Learning Recommendation Model benchmark [8]
123
125
= Relative Performance (Samples/Second)
75
123x better 59
performance per watt 52
50
39
31
25
11
1 1
0
Intel Xeon Platinum
Intel 8380H8380H 8S NVIDIA
NVIDIA T4T4 NVIDIA A10
NVIDIA A10 Esperanto ET-SoC-1
ET-SOC-1 x6
24,630 Samples/sec with 8 die [9a.1] 665,646 with 20 T4 cards [9b] 772,378 with 8 A10 cards[9c] 182,418 with 1 card (6 chips) [9d]
2000 watts for 8 die 1400 watts for 20 T4 1200 watts for 8 A10 120 watts
3,079 Samples/sec/die 33,282 Samples/sec/T4 card 96,547 Samples/sec/A10 card 182,418 Samples/sec/card
250 watts per die [9a.2] 70 watts per T4 card 150 watts per A10 card 120 watts per card
MLPerf DLRM results for Intel and NVIDIA from MLCommons website[8], power numbers from data sheets. Esperanto estimates[9d] 17
Image Classification performance per card comparisons
30 Based on ResNet-50 benchmark [10]
25.7
25 = Relative Performance (Inferences/Second)
ResNet-50 numbers taken from respective company websites, power numbers from data sheets. Esperanto estimates
18
Four ET-Maxions: High-Performance Out-of-Order RISC-V Processors
20
Summary
The Esperanto ET-SoC-1 is the highest performance commercial RISC-V chip to date
• More RISC-V cores on a single chip
• More RISC-V aggregate instructions per second on a single chip
• Highest TOPS driven by RISC-V cores
Esperanto’s low-voltage technology provides differentiated RISC-V processors with the best performance per watt
• Energy efficiency matters!
• Best performance per watt delivers the best performance in a fixed number of watts
• Solution delivers energy efficient acceleration for datacenter inference workloads, especially recommendation
Early Access Program for qualified customers beginning later in 2021 (for info, contact: [email protected])
21
Thanks to our Key Development Partners
Thanks to all our partners for their help in bringing our vision into reality! Sorry we can’t name
everyone!
22
Footnotes and References
[1] N. Jouppi, et al., Ten Lessons from Three Generations Shaped Google’s TPUv4i, 2021 ACM/IEEE 48th Annual International Symposium on Computer Architecture. Page 4 lesson 5 concludes “Inference
DSAs need air cooling for global scale”.
[2] J. Park, et al., Deep Learning Inference in Facebook Data Centers: Characterization, Performance Optimizations and Hardware Implications, arXiv:1811.09886v2, 29 November 2018.
[3] M. Anderson, et al., First Generation Inference Accelerator Deployment at Facebook, arXiv: 2107.04140v1, 8 Jul 2021
[4] M. Smelyanskiy, Facebook, Linley Fall Processor Conference 2019 “Challenges and Opportunities of Architecting AI Systems at Datacenter Scale”
[5] M. Smelyanskiy, AI System Co-Design: How to Balance Performance & Flexibility, AI Hardware Summit, September 17, 2019. Slide 12 indicated desired Inference characteristics with 100 TOPs of INT8,
100 MB of SRAM. Slide 19 talks about the need for programmability over fixed function hardware.
[6] Note that a core optimized for high voltage and high frequency (2-3GHz) operation will require higher power gate drive strengths to overcome wire delays and hence will have higher Cdyn than a processor
optimized for low-voltage operation. Each of the power/frequency points shown on this energy efficiency curve therefore represents a different physical design, i.e. not the same silicon, to take this changing
Cdyn into account. Designs were synthesized at high and low voltages to estimate potentially achievable frequencies. Performance at each frequency was estimated using our internal ML Recommendation
benchmark, based on running this benchmark on a full chip hardware emulation engine (Synopsys Zebu system) providing clock level accuracy at a few points and interpolating the other points. The goal was
to understand the shape of the energy efficiency curve to find voltages for best energy efficiency (Inferences/second/watt). Different benchmarks would likely have different curves, though we would expect the
overall shape to be similar. Repeating, this was a design study and does not represent any specific silicon results or design, each point on the curve is a differently synthesized design, though with the same
architecture, i.e., we used the full ET-Minion as the input to be synthesized.
[7] Estimate of 30 square feet per rack comes from “The Case for the Infinite Data Center” – Gartner, Source: Gartner, Data Center Frontier
[8] MLPerf DLRM Inference Data Center v0.7 & v1.0: https://mlcommons.org/en/
[9] Measured by MLPerf DLRM Samples / Second; FP32, Offline scores
Additional source information:
• a.1. Submitter: Intel; MLPerf DLRM score 24,630: Inference Data Center v0.7 ID 0.7-126; Hardware used (1-node-8S-CPX-PyTorch-BF16); BF16; https://mlcommons.org/en/inference-datacenter-07/
• a.2 Intel 8380H Processor TDP Power of 250W from: https://ark.intel.com/content/www/us/en/ark/products/204087/intel-xeon-platinum-8380h-processor-38-5m-cache-2-90-ghz.html
• b. Submitter: NVIDIA; T4 MLPerf DLRM score 665,646: Inference Data Center v0.7 ID 0.7-115; Hardware used (Supermicro 6049GP-TRT-OTO-29 (20x T4, TensorRT)); INT8; https://mlcommons.org/en/inference-datacenter-07/
• c. Submitter: NVIDIA; A10 MLPerf DLRM score 772,378: Inference Data Center v1.0 ID 1.0-54; Hardware used (Supermicro 4029GP-TRT-OTO-28 (8x A10, TensorRT)); INT8; https://mlcommons.org/en/inference-datacenter-10/
• d. Internal estimates by Esperanto for MLPerf DLRM: Inference Data Center v0.7; ET-SOC-1; Unverified result is from Emulated/Simulated pre-silicon projections; INT8; Result not verified by MLCommons™ Association.
[10] Measured by ResNet-50 Images per second (Esperanto INT8 Batch 8, NVIDIA INT8 Batch 8, Habana INT8 Batch 10, Intel INT8 Batch 11)
Additional measurement source information:
• a.1. Intel ResNet-50: https://software.intel.com/content/www/us/en/develop/articles/intel-cpu-outperforms-nvidia-gpu-on-resnet-50-deep-learning-inference.html
• a.2. Intel 9282 has 2 die in the package, CPU TDP power for both die from: https://ark.intel.com/content/www/us/en/ark/products/194146/intel-xeon-platinum-9282-processor-77m-cache-2-60-ghz.html
• b. NVIDIA (T4, A10) ResNet-50: https://developer.nvidia.com/deep-learning-performance-training-inference
• c. Habana ResNet-50: https://habana.ai/wp-content/uploads/2018/09/Goya-Datasheet-HL-10x-Nov14-2018.pdf
• d. Esperanto ResNet-50: Emulated/Simulated projections; INT8
[11] P. Xekalakis and C. Celio, The Esperanto ET-Maxion High Performance Out-of-Order RISC-V Processor, 2018 RISC-V Summit, presentation at https://www.youtube.com/watch?v=NjEslX_-t0Q
[12] Maxion is described in “Esperanto Maxes out RISC-V - High-End Maxion CPU Raises RISC-V Performance Bar”, Microprocessor Report, December 10, 2018.
23