2024 - Intel - Tech Tour TW - Lunar Lake AI Hardware Accelerators
2024 - Intel - Tech Tour TW - Lunar Lake AI Hardware Accelerators
AI Hardware
Accelerators
TAP
Intel Fellow
AI PC Landscape
AI Client
Dynamic noise Meeting
Pose estimation Background blurring suppression Chatbots summarization
Workload
Trends Voice recognition AI assistants Text-to-video Depth estimation Style transfer
Growing in diversity
From background blurring to gen AI
Multi Modal
Transformers and diffusion
AI Engine ~25%
~30%
Adoption
NPU
NPU
~40%
GPU
~40%
Multi-engine adoption GPU
Macro ISV trend
GPU
Up to
120
Creator & gamer AI
NPU
AI assistants & gen AI
CPU
platform TOPS Light ”embedded” AI
Lunar Lake
GPU
AI Engine
XMX 67
GPU Xe Matrix peak
architecture Extensions TOPs
All tops are Int8 on high end SKU, will vary based on SKU
Lunar Lake
CPU
AI Engine
P-core &
E-core
VNNI &
AVX
5
CPU peak
architecture AI instructions TOPs
All tops are Int8 on high end SKU, will vary based on SKU
Lunar Lake
NPU Deepdive
Darren Crews
Sr Principal Engineer, NPU Lead Architect
Lunar Lake
NPU
AI Engine
NPU 4 2x 48
Power Peak
Architecture efficiency TOPs
All tops are Int8 on high end SKU, will vary based on SKU
Continuous NPU Improvements
Across 4 generations of IP
48 pTOPs
Proven foundations
based on three prior generations
7 pTOPs NPU 3
0.5 pTOPs 2023
NPU 2
2021
NPU 1
2018
All tops are Int8 on high end SKU, will vary based on SKU
Proven foundations
based on three prior generations
Increase frequency
NPU 3 NPU 4
Improve architecture
What is a TOP?
Trillions of Operations per Second
1 Multiply &
1 Accumulate 1012
(MAC)
How Many AI TOPS in
Meteor Lake’s NPU?
Operations Frequency Trillions TOPs
2
ops/MAC 1.4
1012 11.5
*4096 GHz
MAC/cycle
TOPs
Operation
Types
Overview Scalar Vector Matrix
Complexity 1 N N2
Occurrence in AI
Low Very high Very high
Operation
Types
Overview Scalar Vector Matrix
Complexity 1 N N2
Occurrence in AI
Low Very high Very high
Scaling the NPU
NPU 3 NPU 4
NPU 3 4K 2 NPU 4 12K 6
MACs NCEs MACs NCEs
Scaling the NPU
Increase frequency
NPU 3 NPU 4
Increased
Efficiency & NPU 4
Increased
Performance
4x
Performance
peak
performance
Increased clock 2x
perf at
ISO power1
New node
NPU 3
Architecture
improvements
Power
1 Based on pre-production simulation data of a real network. See backup for details.
Scaling the NPU
NPU 3 NPU 4
Improve architecture
NPU 4 NPU bandwidth
Efficiency
Architecture of matrix
improvements compute
Specialized engines
Matrix + Vector
Inference pipeline
MAC arrays + fixed function
Programmable DSPs
NPU 4
Inference
pipeline
Efficient matrix
multiplication
Activation
function support
Matrix multiplication
& convolution
FP16
16x16x4
2048 MAC/cycle int8
1024 MAC/cycle FP16
Up to 2x1 efficiency
driving better perf/watt
Multiple functions
Supported
FP precision
Support
NPU 4
Data conversion
Datatype conversion
Fused operations
0 255
Output data re-layout
NPU 4
SHAVE DSP
Upgraded
SHAVE DSP
4x vector compute
12x overall
vector perf
improves transformer
/LLM performance
512-bit 4x 4x
Vector register Performance Bandwidth
file size to and from
Vector unit
SHAVE DSP
NPU 3
SHAVE DSP
SHAVE NPU3 SHAVE DSP FP16
128bit
DSP
VRF
FP16 FP16 FP16 FP16 FP16 FP16 FP16 FP16
VAU
Vector increase FP16 FP16 FP16 FP16 FP16 FP16 FP16 FP16
512bit
VRF
FP16 FP16 FP16 FP16 FP16 FP16 FP16 FP16 FP16 FP16 FP16 FP16 FP16 FP16 FP16 FP16 FP16 FP16 FP16 FP16 FP16 FP16 FP16 FP16 FP16 FP16 FP16 FP16 FP16 FP16 FP16 FP16
VAU
FP16 FP16 FP16 FP16 FP16 FP16 FP16 FP16 FP16 FP16 FP16 FP16 FP16 FP16 FP16 FP16 FP16 FP16 FP16 FP16 FP16 FP16 FP16 FP16 FP16 FP16 FP16 FP16 FP16 FP16 FP16 FP16
VRF
32 FP16 vector ops/clock per DSP
NPU 4
DMA engine
2x DMA bandwidth
improves network
performance especially LLMs
New functions
Embedding tokenization
NPU 4
Performance
4x
2x
TOPS
Intel - NPU 4
Image
Use Cases Image
Speech Speech
Generation
Sound Sound
Classification
Brainwaves Brainwaves
Transformer Model Architecture
Encoding
Positional
Encoder +
Encoder Translation
Decoder
Generation Decoder
Inputs
Classification Encoder
Input Multi-Head Add & Feed Add &
Embedding Attention Norm Forward Norm
Masked
Outputs Add & Multi-Head Add & Feed Add &
Multi-Head Linear SoftMax
Embedding Norm Attention Norm Forward Norm
Attention
Outputs
(Shifted right)
Encoding
Positional
Decoder
Input Multi-Head Add & Feed Add &
Embedding Attention Norm Forward Norm
Masked
Outputs Add & Multi-Head Add & Feed Add &
Multi-Head Linear SoftMax
Embedding Norm Attention Norm Forward Norm
Attention
Transformer
Architecture
on Intel’s NPU
Input Multi-Head Add & Feed Add &
Embedding Attention Norm Forward Norm
Masked
Outputs Add & Multi-Head Add & Feed Add &
Multi-Head Linear SoftMax
Embedding Norm Attention Norm Forward Norm
Attention
Transformer
Architecture
on Intel’s NPU
Input Multi-Head Add & Feed Add &
Embedding Attention Norm Forward Norm
Masked
Outputs Add & Multi-Head Add & Feed Add &
Multi-Head Linear SoftMax
Embedding Norm Attention Norm Forward Norm
Attention
Transformer
Architecture
on Intel’s NPU
Multi- Head
Attention Linear
MatMul
0.2 0.8
Similarity
Multi-Head SoftMax normalization ?
Attention
0.25 1.0
MatMul
Word vector
similarity ?
dot-product
Linear Linear Linear
chips laptop
𝑄𝐾 𝑇
𝐴𝑡𝑡𝑒𝑛𝑡𝑖𝑜𝑛 𝑄, 𝐾, 𝑉 = 𝑆𝑜𝑓𝑡𝑀𝑎𝑥 𝑉
𝑑𝑘
Multi- Head
Attention
Flowchart
Word vector
similarity ? MatMul
dot-product chips laptop
𝑄𝐾 𝑇
𝐴𝑡𝑡𝑒𝑛𝑡𝑖𝑜𝑛 𝑄, 𝐾, 𝑉 = 𝑆𝑜𝑓𝑡𝑀𝑎𝑥 𝑉
𝑑𝑘
Multi- Head
Attention
Flowchart
0.2 0.8
SoftMax
Similarity ?
normalization
0.25 1.0
𝑄𝐾 𝑇
𝐴𝑡𝑡𝑒𝑛𝑡𝑖𝑜𝑛 𝑄, 𝐾, 𝑉 = 𝑆𝑜𝑓𝑡𝑀𝑎𝑥 𝑉
𝑑𝑘
Multi- Head
Attention
Flowchart
MatMul
Attention
?
score
calculation chips 1 chips
(7.4, 9) (9,9)
𝑄𝐾 𝑇
𝐴𝑡𝑡𝑒𝑛𝑡𝑖𝑜𝑛 𝑄, 𝐾, 𝑉 = 𝑆𝑜𝑓𝑡𝑀𝑎𝑥 𝑉
𝑑𝑘
Stable Diffusion Architecture
“Cute kitten
Text encoder U-Net+ U-Net- VAE
with a pink bow” Output
Stable Diffusion Architecture
U-Net diffusion
U-Net diffusion
Transpose Convolution
Transpose Convolution
Transpose Convolution
Text prompt Image
understanding decoder
Concat
Attention Attention Attention Attention
Layer Layer Layer Layer
Switch
Accelerating Multi-Head Attention
Performance on U-Net
0ms 50ms 100ms 150ms 200ms 250ms 300ms 350ms
MAC ARRAY
SHAVE DSP
SHAVE DSP
MAC ARRAY
SHAVE DSP
SHAVE DSP
MAC ARRAY
SHAVE DSP
SHAVE DSP
MAC ARRAY
SHAVE DSP
SHAVE DSP
MAC ARRAY
up to
SHAVE DSP
SHAVE DSP 9x faster
MAC ARRAY
SHAVE DSP attention
calculation
SHAVE DSP
MAC ARRAY
SHAVE DSP
SHAVE DSP
MAC ARRAY
SHAVE DSP
SHAVE DSP
QK (QK)*V
SoftMax SoftMax
MatMul MatMul
MAC ARRAY
SHAVE DSP
SHAVE DSP
MAC ARRAY
SHAVE DSP
SHAVE DSP
MAC ARRAY
SHAVE DSP
SHAVE DSP
MAC ARRAY
SHAVE DSP
SHAVE DSP
MAC ARRAY
SHAVE DSP
SHAVE DSP
MAC ARRAY
SHAVE DSP
SHAVE DSP
Stable Diffusion
Stable Diffusion Demo
“Cute kitten
Text encoder U-Net+ U-Net- VAE
with a pink bow” Output
42 Inferences
Text Encoder (1)
+ U-Net+ (20) 11.2W
+ U-Net– (20)
+ VAE Decoder (1) 9.0W
5.8s
Meteor Lake 2.9x
Lunar Lake 1x
48
Largest integrated and
2x
Up to
Efficiency
dedicated AI accelerator optimized
for the AI PC MAC array
Bandwidth
12 Enhanced
SHAVE DSPs
TOPS Embedding tokenization
used for LLMs
Accelerating LLM & transformer operations
6
Neural
compute
engines
Thank
You
Notices & Disclaimers
The preceding presentation contains product features that are currently under development. Information shown through the presentation is based on current expectations and subject to change without notice.
Results that are based on pre-production systems and components as well as results that have been estimated or simulated using an Intel Reference Platform (an internal example new system), internal Intel
analysis or architecture simulation or modeling are provided to you for informational purposes only. Results may vary based on future changes to any systems, components, specifications or configurations.
Performance varies by use, configuration and other factors. Learn more at www.intel.com/PerformanceIndex.
AI features may require software purchase, subscription or enablement by a software or platform provider, or may have specific configuration or compatibility requirements. Details at www.intel.com/AIPC.
No product or component can be absolutely secure. Intel technologies may require enabled hardware, software or service activation.
All product plans and roadmaps are subject to change without notice.
Performance hybrid architecture combines two core microarchitectures, Performance-cores (P-cores) and Efficient-cores (E-cores), on a single processor die first introduced on 12th Gen Intel® Core processors.
Select 12th Gen and newer Intel® Core processors do not have performance hybrid architecture, only P-cores or E-cores, and may have the same cache size. See ark.intel.com for SKU details, including cache
size and core frequency.
Built-in Intel® Arc GPU only available on select Intel® Core Ultra processor-powered systems; OEM enablement required.
Some images may have been altered or simulated and are for illustrative purposes only.
Intel does not control or audit third-party data. You should consult other sources to evaluate accuracy.
© Intel Corporation. Intel, the Intel logo, and other Intel marks are trademarks of Intel Corporation or its subsidiaries. Other names and brands may be claimed as the property of others.
APPENDIX
Claim # & Statement Slide # & Title/Details
Testing by Intel as of January 2024. Based on VPU-EM simulation. Power data is generated from the simulation tool based on power data
2x performance at ISO power vs.
that has been extracted from circuit simulation tools. This simulation, which is a ~100% utilization int8 network, is expected to correlate well
Meteor Lake
with silicon.
4x peak performance 4x peak performance is based on TOPS increase from MTL (11 TOPS) to LNL (48 TOPS).
4x Vector compute Based on 4x vector width increase vs. NPU3 . NPU3 has 8 FP16 Vector ops/clock, NPU4 has 32
12x overall vector performance Vector performance = 3x tiles and 4x the vector width (vs. NPU3 )
12x vector performance Vector performance = 3x tiles and 4x the vector width (vs. NPU3 )
TOPS calculation is # of tiles * fmax frequency * ops clock
Meteor Lake is up to 11.5 TOPS, Lunar Lake is up to 48 TOPS;
4x TOPS
Meteor Lake TOPS = (2 tiles * 1.4GHz * 4096 ops/clock)/1000
Lunar Lake TOPS = (6 tiles * 1.95GHz * 4096 ops/clock)/1000
2x IP bandwidth IP Bandwidth: Meteor Lake is 64GB/s; Lunar Lake is 136 GB/s.
Lunar Lake vs. Meteor Lake Testing by Intel as of May 2024. Data based on Lunar Lake reference validation platform vs. Intel® Core Ultra 7 155H
performance, power and 32GB LPDDR5-6400Mhz (Meteor Lake). Calculated using open source GIMP with NPU plug in. Text Encoder, & Unet +/- are running on
efficiency ratio the NPU. VAE is running on the built-in GPU.