0% found this document useful (0 votes)

40 views

Accelerating ML Recommendation

Uploaded by

Gary Ryan Donovan

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

40 views

Accelerating ML Recommendation

Uploaded by

Gary Ryan Donovan

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 23

Accelerating ML Recommendation with over a Thousand

RISC-V/Tensor Processors on Esperanto’s ET-SoC-1 Chip

Dave Ditzel, email: [email protected], Founder and Executive Chairman, Esperanto Technologies,

Roger Espasa, Nivard Aymerich, Allen Baum, Tom Berg, Jim Burr, Eric Hao, Jayesh Iyer, Miquel Izquierdo, Shankar Jayaratnam,
Darren Jones, Chris Klingner, Jin Kim, Stephen Lee, Marc Lupon, Grigorios Magklis, Bojan Maric, Rajib Nath, Mike Neilly, Duane
Northcutt, Bill Orner, Jose Renau, Gerard Reves, Xavier Reves, Tom Riordan, Pedro Sanchez, Sri Samudrala, Guillem Sole,
Raymond Tang, Tommy Thorn, Francisco Torres, Sebastia Tortella, Daniel Yau

Esperanto Technologies Hot Chips 33 Presentation August 24, 2021

The ET-SoC-1 is the highest performance commercial RISC-V chip

The ET-SoC-1 has over a thousand RISC-V processors on a single TSMC 7nm chip, including:
• 1088 energy-efficient ET-Minion 64-bit RISC-V in-order cores each with a vector/tensor unit
• 4 high-performance ET-Maxion 64-bit RISC-V out-of-order cores
• >160 million bytes of on-chip SRAM
• Interfaces for large external memory with low-power LPDDR4x DRAM and eMMC FLASH
• PCIe x8 Gen4 and other common I/O interfaces
• Innovative low-power architecture and circuit techniques allows entire chip to
• Compute at peak rates of 100 to 200 TOPS
• Operate using under 20 watts for ML recommendation workloads

This general-purpose parallel-processing system on a chip can be used for many parallelizable workloads

But today, we want to show why it is a compelling solution for Machine Learning Recommendation (inference)
• ML Recommendation is one of the hardest and most important problems for many hyperscale data centers

2
Requirements and challenges for ML Recommendation in large datacenters

Most inferencing workloads for recommendation systems in large data centers are run on x86 servers

Often these servers have an available slot for an accelerator card, but it needs to meet key requirements:
• 100 TOPS to 1000 TOPS peak rates to provide better performance than the x86 host CPU alone
• Limited power budget per card, perhaps 75 to 120 watts, and must be air-cooled[1]
• Strong support for Int8, but must also support FP16 and FP32 data types[1,2]
• ~100 GB of memory capacity on the accelerator card to hold most embeddings, weights and activations[3]
• ~100 MB of on-die memory[5]
• Handle both dense and sparse compute workloads. Embedding look-up is sparse matrix by dense matrix
multiplication[5]
• Be programmable to deal with rapidly evolving workloads[1], rather than depending on overly-specialized
hardware[4,5]

3
Esperanto’s approach is different... and we think better for ML Recommendation

Other ML Chip approaches Esperanto’s better approach

DRAM DRAM DRAM DRAM DRAM DRAM
10-20 CPU cores
DRAM
and
> 1000 > 1000 > 1000 > 1000 > 1000 > 1000
Systolic Array Multipliers RISC-V/Tensor RISC-V/Tensor RISC-V/Tensor RISC-V/Tensor RISC-V/Tensor RISC-V/Tensor
Cores Cores Cores Cores Cores Cores

One Giant Hot Chip uses up power budget Use multiple low-power chips that still fit within power budget
Limited I/O pin budget limits memory BW Performance, pins, memory, bandwidth scale up with more chips
Dependence on systolic array multipliers Thousands of general-purpose RISC-V/tensor cores
• Great for high ResNet50 score • Far more programmable than overly-specialized (eg systolic) hw
• Not so good with large sparse memory • Thousands of threads help with large sparse memory latency
Only a handful (10-20) of CPU cores Full parallelism of thousands of cores always available
• Limited parallelism with CPU cores when Low-voltage operation of transistors is more energy-efficient
problem doesn’t fit onto array multiplier • Lower voltage operation also reduces power
Standard voltage: Not energy efficient • Requires both circuit and architecture innovations
Challenge: How to put the highest ML Recommendation performance
onto a single accelerator card with a 120-watt limit?
4
Could fit six chips on 120W card, if each took less than 20 watts

Assumed half of 20W power for 1K RISC-V cores, so only 10 mW per core!

Power (Watts) = Cdynamic x Voltage2 x Frequency + Leakage

Power/core Frequency Voltage Cdynamic

Generic x86 Server core (165W for 24 cores) 7W 3 GHz 0.850v 2.2nF
10mW ET-Minion core (~10W for 1K cores) 0.01 W 1 GHz 0.425v 0.04nF
Reductions needed to hit goals ~700x 3x 4x 58x
Easy Hard Very Hard
Circuit/SRAM Architecture
5
Study of energy-efficiency and number of chips to get best ML Performance in 120 watts[6]

20x better Energy-Efficiency

of ET-Minion cores (Inferences/Sec/Watt)

8.5 W 2.5x better performance than the 118W chip

by using lowest voltage instead of
Recommendation Energy Efficiency

6 chips highest voltage for our

recommendation benchmark
Esperanto’s
sweet-spot for
best performance
4x better ML
20 W recommendation
6 chips performance

275 W
1 chip
0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
Operating voltage for the 1K ET-Minion RISC-V/Tensor cores
6
ET-Minion is an Energy-Efficient RISC-V CPU with a Vector/Tensor Unit

256b Floating Point 512b Int8 Vector RF

ET-MINION IS A CUSTOM BUILT 64-BIT RISC-V PROCESSOR 32-bit & 16-bit FMA Bypass TIMA TIMA VPU RF T0/T1 Trans
ROMs
• In-order pipeline with low gates/stage to improve MHz at low voltages 32-bit & 16-bit FMA Bypass TIMA TIMA VPU RF T0/T1

Vector/Tensor Unit
• Architecture and circuits optimized to enable low-voltage operation 32-bit & 16-bit FMA Bypass TIMA TIMA VPU RF T0/T1 Trans
• Two hardware threads of execution ROMs
32-bit & 16-bit FMA Bypass TIMA VPU RF T0/T1
•
TIMA
Software configurable L1 data-cache and/or scratchpad
32-bit & 16-bit FMA Bypass TIMA TIMA VPU RF T0/T1 Trans
ROMs
ML OPTIMIZED VECTOR/TENSOR UNIT 32-bit & 16-bit FMA Bypass TIMA TIMA VPU RF T0/T1

• 512-bit wide integer per cycle 32-bit & 16-bit FMA Bypass TIMA TIMA VPU RF T0/T1 Trans
• 128 8-bit integer operations per cycle, accumulates to 32-bit Int 32-bit & 16-bit FMA Bypass TIMA TIMA VPU RF T0/T1
ROMs

• 256-bit wide floating point per cycle

RISC-V Integer
• 16 32-bit single precision operations per cycle RISC-V Integer Pipeline
• 32 16-bit half precision operations per cycle Data Cache Control
• New multi-cycle Tensor Instructions (including D-tags)
L1 Data-Cache/Scratchpad
Front
End
• Can run for up to 512 cycles (up to 64K operations) with one tensor instruction
DC bank 0 DC bank 1 DC bank 2 DC bank 3
• Reduces instruction fetch bandwidth and reduces power
• RISC-V integer pipeline put to sleep during tensor instructions ET-Minion RISC-V Core and Tensor/Vector unit
• Vector transcendental instructions optimized for low-voltage operation
to improve energy-efficiency
OPERATING RANGE: 300 MHz TO 2 GHz

Optimized for energy-efficient ML operations. Each ET-Minion can deliver peak of 128 Int8 GOPS per GHz

7
8 ET-Minions form a “Neighborhood”

NEIGHBORHOOD CORES WORK CLOSELY TOGETHER

ET-Minion 0 ET-Minion 4
• Architecture improvements capitalize on physical proximity of 8 cores

Instruction Cache
32 KB shared
• Take advantage that almost always running highly parallel code
ET-Minion 1 ET-Minion 5

OPTIMIZATIONS FROM CORES RUNNNING THE SAME CODE ET-Minion 2 ET-Minion 6

• 8 ET-Minions share single large instruction cache,
this is more energy efficient than many separate instruction caches. ET-Minion 3 ET-Minion 7
• “Cooperative loads” substantially reduce memory traffic to L2 cache

NEW INSTRUCTIONS MAKE COOPERATION MORE EFFICIENT

• New Tensor instructions dramatically cut back on instruction fetch bandwidth
• New instructions for fast local synchronization within group
• New Send-to-Neighbor instructions
• New Receive-from-Neighbor instructions

8
32 ET-Minion CPUs and 4 MB Memory form a “Minion Shire”

Four 8-Core Mesh

Neighborhoods Interconnect 32 ET-MINION RISC-V CORES PER MINION SHIRE
Nominal Voltage
• Arranged in four 8-core neighborhoods

m3 m0 Bank0
m7 m4 (1MB) SOFTWARE CONFIGURABLE MEMORY HIERARCHY
m11 m8 Bank1 • L1 data cache can also be configured as scratchpad
m15 m12
4x4
(1MB) Mesh • Four 1MB SRAM banks can be partitioned as private L2,
stop shared L3 and scratchpad
m19 m16 xbar Bank2
m23 m20 (1MB)
m27 m24 Minion
m31 m28
Bank3
Shire SHIRES CONNECTED WITH MESH NETWORK
(1MB)

Low Voltage Low Voltage NEW SYNCHRONIZATION PRIMITIVES

4MB Banked SRAM • Fast local atomics
Cache/Scratchpad • Fast local barriers
Local Sync Primitives • Fast local credit counter
• IPI support

9
Shires are connected to each other and to external memory through Mesh Network

Minion Shire Minion Shire

m3 m0 m3 m0
Bank0 Bank0
m7 m4 (1MB) m7 m4 (1MB)
m11 m8 m11 m8
Bank1 Bank1
m15 m12 (1MB) Mesh m15 m12 (1MB) Mesh
stop stop
m19 m16 m19 m16
Bank2 Bank2
m23 m20
4x4 (1MB) m23 m20 4x4 (1MB)
xbar xbar
m27 m24 m27 m24
Bank3 Bank3
m31 m28 (1MB) m31 m28 (1MB)

Minion Shire Minion Shire

m3 m0 m3 m0
Bank0 Bank0
m7 m4 (1MB) m7 m4 (1MB)
m11 m8 Bank1 m11 m8
Bank1
m15 m12 (1MB) Mesh m15 m12 (1MB) Mesh
stop stop
m19 m16 Bank2 m19 m16
4x4 4x4 Bank2
m23 m20 (1MB) m23 m20 (1MB)
xbar xbar
m27 m24 m27 m24
Bank3 Bank3
m31 m28 (1MB) m31 m28 (1MB)

10
ET-SoC-1: Full chip internal block diagram

MINION MINION MINION MINION RISC-V Service

SHIRE SHIRE SHIRE PCIe SHIRE MAXION / IO SHIRE
SHIRE Processor

34 MINION SHIRES
PCIe logic DFT/eFuses
ET-Maxion ET-Maxion
• 1088 ET-Minions
ET-Maxion ET-Maxion

8 MEMORY SHIRES
LPDDR4x
• LPDDR4x DRAM
LPDDR4x
16-bits 16-bits
controller controller

16-bits
LPDDR4x LPDDR4x
16-bits
controllers
controller controller

MEMORY SHIRE MEMORY SHIRE

1 MAXION / IO SHIRE
16-bits
LPDDR4x LPDDR4x
16-bits
• 4 ET-Maxions
controller controller
• 1 RISC-V Service
LPDDR4x
16-bits
controller
LPDDR4x
controller
16-bits Processor
MEMORY SHIRE MEMORY SHIRE

PCIe SHIRE
LPDDR4x LPDDR4x
16-bits 16-bits
controller controller

LPDDR4x LPDDR4x
160 million bytes of
16-bits 16-bits
controller controller
on-die SRAM
MEMORY SHIRE MEMORY SHIRE

LPDDR4x LPDDR4x
x8 PCIe Gen4
16-bits 16-bits
controller controller

16-bits
LPDDR4x
controller
LPDDR4x
controller
16-bits Secure Root of Trust
MEMORY SHIRE MEMORY SHIRE

MINION MINION MINION MINION MINION MINION

SHIRE SHIRE SHIRE SHIRE SHIRE SHIRE

Block diagram of Esperanto’s energy-efficient ET-SoC-1 chip

11
ET-SoC-1 External Chip Interfaces

8-bit PCIe Gen4

• Root/endpoint/both

256-bit wide LPDDR4x

• 4267 MT/s
• 137 GB/s
• ECC support

RISC-V SERVICE PROCESSOR

• Secure Boot
• System Management
• Watchdog timers
• eFuse

EXTERNAL IO
• SMBus
• Serial – I2C/SPI/UART
• GPIO
• FLASH

12
Card with six ET-SOC-1 chips for large sparse ML Recommendation models
• Esperanto’s low-power technology allows six Esperanto chips and 24 DRAM chips to fit into120-Watt power budget of customer’s PCIe card
• A single ML model on one accelerator card can use up to 192 GB of low-cost LPDDR4x DRAM with up to 822 GB/s of memory bandwidth
• Over 6K cores with 12K threads handles memory latency on 96 memory channels and performs well for ML Recommendation (and other) tasks