0% found this document useful (0 votes)
20 views60 pages

2024 - Intel - Tech Tour TW - Lunar Lake AI Hardware Accelerators

The document discusses the advancements and trends in AI hardware, particularly focusing on Intel's Lunar Lake architecture and its multi-engine approach, which includes GPU, CPU, and NPU components. It highlights the increasing adoption of AI capabilities in PCs, the performance improvements across different generations of NPUs, and the architecture's support for various AI workloads such as text-to-video and voice recognition. Additionally, it covers the technical specifications and operational capabilities of the NPU, emphasizing its efficiency and performance enhancements for AI applications.

Uploaded by

308mao
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views60 pages

2024 - Intel - Tech Tour TW - Lunar Lake AI Hardware Accelerators

The document discusses the advancements and trends in AI hardware, particularly focusing on Intel's Lunar Lake architecture and its multi-engine approach, which includes GPU, CPU, and NPU components. It highlights the increasing adoption of AI capabilities in PCs, the performance improvements across different generations of NPUs, and the architecture's support for various AI workloads such as text-to-video and voice recognition. Additionally, it covers the technical specifications and operational capabilities of the NPU, emphasizing its efficiency and performance enhancements for AI applications.

Uploaded by

308mao
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 60

Lunar Lake

AI Hardware
Accelerators
TAP
Intel Fellow
AI PC Landscape

AI Client
Dynamic noise Meeting
Pose estimation Background blurring suppression Chatbots summarization

Workload
Trends Voice recognition AI assistants Text-to-video Depth estimation Style transfer

Growing in diversity
From background blurring to gen AI

App & OS integration


Recommendations Denoising Code debug Identification Text-to-music
from features in apps to OS co-pilots

Multi Modal
Transformers and diffusion

Code generation Text-to-image Realtime filters Contextual search Image upscaling


2024 2025
AI PC Trends

AI Engine ~25%
~30%
Adoption
NPU
NPU

~40%
GPU
~40%
Multi-engine adoption GPU
Macro ISV trend

GPU role significant


In ISV feature plans through ‘25
~35%
CPU
~30%
Multiple performant engines CPU
Are best fit for enabling ISV efforts

Based on internal Intel research as of May 2024.


Unmatched
AI Compute
With our Multi-Engine approach

GPU
Up to

120
Creator & gamer AI

NPU
AI assistants & gen AI

CPU
platform TOPS Light ”embedded” AI
Lunar Lake

GPU
AI Engine

XMX 67
GPU Xe Matrix peak
architecture Extensions TOPs

All tops are Int8 on high end SKU, will vary based on SKU
Lunar Lake

CPU
AI Engine

P-core &
E-core
VNNI &
AVX
5
CPU peak
architecture AI instructions TOPs

All tops are Int8 on high end SKU, will vary based on SKU
Lunar Lake
NPU Deepdive

Darren Crews
Sr Principal Engineer, NPU Lead Architect
Lunar Lake

NPU
AI Engine

NPU 4 2x 48
Power Peak
Architecture efficiency TOPs

All tops are Int8 on high end SKU, will vary based on SKU
Continuous NPU Improvements
Across 4 generations of IP

48 pTOPs
Proven foundations
based on three prior generations

Higher compute capacity


to support growing number of use cases

11.5 pTOPs NPU 4


Increased efficiency
2024
to support longer battery life

7 pTOPs NPU 3
0.5 pTOPs 2023
NPU 2
2021
NPU 1
2018

All tops are Int8 on high end SKU, will vary based on SKU
Proven foundations
based on three prior generations

NPU 4 Higher compute capacity


to support growing number of use cases
2024
Increased efficiency
to support longer battery life
Scaling the NPU

Increase number of engines

Increase frequency

NPU 3 NPU 4
Improve architecture
What is a TOP?
Trillions of Operations per Second

1 Multiply & Clock


1 Accumulate Frequency
(MAC) (Hz)
How Many AI TOPS?
Operations per Second Trillions

1 Multiply & Clock


1 Accumulate frequency 1012
(MAC) (Hz)
How Many AI TOPS?
Operations Frequency Trillions

1 Multiply &
1 Accumulate 1012
(MAC)
How Many AI TOPS in
Meteor Lake’s NPU?
Operations Frequency Trillions TOPs

1 Multiply & Clock


1 Accumulate Frequency 1012
(MAC) (Hz)
How Many AI TOPS in
Meteor Lake’s NPU?
Operations Frequency Trillions

2
ops/MAC 1.4
1012 11.5
*4096 GHz
MAC/cycle
TOPs
Operation
Types
Overview Scalar Vector Matrix

Complexity 1 N N2

Conditional SoftMax Convolution


Example functions
Looping Activation functions Matrix multiplication

Occurrence in AI
Low Very high Very high
Operation
Types
Overview Scalar Vector Matrix

Complexity 1 N N2

Conditional SoftMax Convolution


Example functions
Looping Activation functions Matrix multiplication
TOPs

Occurrence in AI
Low Very high Very high
Scaling the NPU

Increase number of engines

NPU 3 NPU 4
NPU 3 4K 2 NPU 4 12K 6
MACs NCEs MACs NCEs
Scaling the NPU

Increase frequency

NPU 3 NPU 4
Increased
Efficiency & NPU 4

Increased
Performance
4x

Performance
peak
performance
Increased clock 2x
perf at
ISO power1

New node
NPU 3
Architecture
improvements

Power

1 Based on pre-production simulation data of a real network. See backup for details.
Scaling the NPU

NPU 3 NPU 4
Improve architecture
NPU 4 NPU bandwidth
Efficiency
Architecture of matrix
improvements compute

Vector performance Increased number of tiles


NPU 4
Architecture
overview
NPU 4
Architecture
overview

Global control & MMU


NPU 4
Architecture
overview

Global control & MMU

DMA & scratchpad RAM


NPU 4
Architecture
overview

Global control & MMU

DMA & scratchpad RAM

Neural compute engines


NPU 4
Neural
compute
engine

Specialized engines
Matrix + Vector

Inference pipeline
MAC arrays + fixed function

Programmable DSPs
NPU 4
Inference
pipeline

Efficient matrix
multiplication

Activation
function support

Data conversion and


re-layout support
NPU 4 INT8
MAC array 16x16x8

Matrix multiplication
& convolution
FP16
16x16x4
2048 MAC/cycle int8
1024 MAC/cycle FP16

Up to 2x1 efficiency
driving better perf/watt

1See details in backup


NPU 4
Activation
functions
ReLU 𝓍 ≜ max (0, 𝓍)

Multiple functions
Supported

FP precision
Support
NPU 4
Data conversion

min (Xf) 0 max (Xf)

Datatype conversion

Fused operations

0 255
Output data re-layout
NPU 4
SHAVE DSP

Upgraded
SHAVE DSP
4x vector compute

12x overall
vector perf
improves transformer
/LLM performance

See details in backup


NPU 4
SHAVE DSP

512-bit 4x 4x
Vector register Performance Bandwidth
file size to and from
Vector unit
SHAVE DSP

NPU 3
SHAVE DSP
SHAVE NPU3 SHAVE DSP FP16

128bit

DSP
VRF
FP16 FP16 FP16 FP16 FP16 FP16 FP16 FP16

VAU
Vector increase FP16 FP16 FP16 FP16 FP16 FP16 FP16 FP16

8 FP16 vector ops/clock per DSP


VRF

NPU4 SHAVE DSP FP16

512bit
VRF
FP16 FP16 FP16 FP16 FP16 FP16 FP16 FP16 FP16 FP16 FP16 FP16 FP16 FP16 FP16 FP16 FP16 FP16 FP16 FP16 FP16 FP16 FP16 FP16 FP16 FP16 FP16 FP16 FP16 FP16 FP16 FP16

VAU
FP16 FP16 FP16 FP16 FP16 FP16 FP16 FP16 FP16 FP16 FP16 FP16 FP16 FP16 FP16 FP16 FP16 FP16 FP16 FP16 FP16 FP16 FP16 FP16 FP16 FP16 FP16 FP16 FP16 FP16 FP16 FP16

VRF
32 FP16 vector ops/clock per DSP
NPU 4
DMA engine

2x DMA bandwidth
improves network
performance especially LLMs

New functions
Embedding tokenization

See backup for details.


12x
Vector performance

NPU 4
Performance
4x
2x
TOPS
Intel - NPU 4

Intel - NPU 3 IP bandwidth

See backup for details.


Text
Transformer Text

Image
Use Cases Image

Video Translation Video

Speech Speech
Generation

Sound Sound

Classification
Brainwaves Brainwaves
Transformer Model Architecture

Encoding
Positional
Encoder +
Encoder Translation
Decoder

Generation Decoder
Inputs

Classification Encoder
Input Multi-Head Add & Feed Add &
Embedding Attention Norm Forward Norm

Masked
Outputs Add & Multi-Head Add & Feed Add &
Multi-Head Linear SoftMax
Embedding Norm Attention Norm Forward Norm
Attention

Outputs
(Shifted right)
Encoding
Positional

Decoder
Input Multi-Head Add & Feed Add &
Embedding Attention Norm Forward Norm

Masked
Outputs Add & Multi-Head Add & Feed Add &
Multi-Head Linear SoftMax
Embedding Norm Attention Norm Forward Norm
Attention

Transformer
Architecture
on Intel’s NPU
Input Multi-Head Add & Feed Add &
Embedding Attention Norm Forward Norm

Masked
Outputs Add & Multi-Head Add & Feed Add &
Multi-Head Linear SoftMax
Embedding Norm Attention Norm Forward Norm
Attention

Transformer
Architecture
on Intel’s NPU
Input Multi-Head Add & Feed Add &
Embedding Attention Norm Forward Norm

Masked
Outputs Add & Multi-Head Add & Feed Add &
Multi-Head Linear SoftMax
Embedding Norm Attention Norm Forward Norm
Attention

Transformer
Architecture
on Intel’s NPU
Multi- Head
Attention Linear

Flowchart Concat Attention


score
?

calculation chips 1 chips


(7.4, 9) (9,9)

MatMul
0.2 0.8

Similarity
Multi-Head SoftMax normalization ?
Attention
0.25 1.0
MatMul

Word vector
similarity ?
dot-product
Linear Linear Linear
chips laptop

𝑄𝐾 𝑇
𝐴𝑡𝑡𝑒𝑛𝑡𝑖𝑜𝑛 𝑄, 𝐾, 𝑉 = 𝑆𝑜𝑓𝑡𝑀𝑎𝑥 𝑉
𝑑𝑘
Multi- Head
Attention
Flowchart

Word vector
similarity ? MatMul
dot-product chips laptop

𝑄𝐾 𝑇
𝐴𝑡𝑡𝑒𝑛𝑡𝑖𝑜𝑛 𝑄, 𝐾, 𝑉 = 𝑆𝑜𝑓𝑡𝑀𝑎𝑥 𝑉
𝑑𝑘
Multi- Head
Attention
Flowchart

0.2 0.8
SoftMax
Similarity ?
normalization
0.25 1.0

𝑄𝐾 𝑇
𝐴𝑡𝑡𝑒𝑛𝑡𝑖𝑜𝑛 𝑄, 𝐾, 𝑉 = 𝑆𝑜𝑓𝑡𝑀𝑎𝑥 𝑉
𝑑𝑘
Multi- Head
Attention
Flowchart
MatMul
Attention
?
score
calculation chips 1 chips
(7.4, 9) (9,9)

𝑄𝐾 𝑇
𝐴𝑡𝑡𝑒𝑛𝑡𝑖𝑜𝑛 𝑄, 𝐾, 𝑉 = 𝑆𝑜𝑓𝑡𝑀𝑎𝑥 𝑉
𝑑𝑘
Stable Diffusion Architecture

Text prompt U-Net Image


understanding diffusion decoder

“Cute kitten
Text encoder U-Net+ U-Net- VAE
with a pink bow” Output
Stable Diffusion Architecture

U-Net diffusion

Step 1 Step 2 Step 3 Step 4 Step N

Text prompt Image


understanding decoder

Text encoder VAE


Stable Diffusion Architecture

U-Net diffusion

Denoising Step 1 Step N

Transpose Convolution

Transpose Convolution

Transpose Convolution
Text prompt Image
understanding decoder
Concat
Attention Attention Attention Attention
Layer Layer Layer Layer

Text encoder VAE

Switch
Accelerating Multi-Head Attention
Performance on U-Net
0ms 50ms 100ms 150ms 200ms 250ms 300ms 350ms

MAC ARRAY
SHAVE DSP
SHAVE DSP

MAC ARRAY
SHAVE DSP
SHAVE DSP

MAC ARRAY
SHAVE DSP
SHAVE DSP

MAC ARRAY
SHAVE DSP
SHAVE DSP

MAC ARRAY
up to
SHAVE DSP
SHAVE DSP 9x faster
MAC ARRAY
SHAVE DSP attention
calculation
SHAVE DSP

MAC ARRAY
SHAVE DSP
SHAVE DSP

MAC ARRAY
SHAVE DSP
SHAVE DSP

*See details in backup


Accelerating Multi-Head Attention
Performance on U-Net

QK (QK)*V
SoftMax SoftMax
MatMul MatMul

MAC ARRAY
SHAVE DSP
SHAVE DSP

MAC ARRAY
SHAVE DSP
SHAVE DSP

MAC ARRAY
SHAVE DSP
SHAVE DSP

MAC ARRAY
SHAVE DSP
SHAVE DSP

MAC ARRAY
SHAVE DSP
SHAVE DSP

MAC ARRAY
SHAVE DSP
SHAVE DSP
Stable Diffusion
Stable Diffusion Demo

Text prompt U-Net Image


understanding diffusion decoder

“Cute kitten
Text encoder U-Net+ U-Net- VAE
with a pink bow” Output

Meteor Lake CPU NPU GPU

Lunar Lake NPU NPU GPU

Data Type FP16 INT8 FP16


Stable
Diffusion v1.5 20.9s
20 Iterations

42 Inferences
Text Encoder (1)
+ U-Net+ (20) 11.2W
+ U-Net– (20)
+ VAE Decoder (1) 9.0W

5.8s
Meteor Lake 2.9x
Lunar Lake 1x

Performance Power Efficiency ratio


Image generation time Avg SoC package power Higher is better
Lower is better Lower is better

See backup for details. Results may vary.


Next Gen Native activation
function & data

NPU 4 conversion support

48
Largest integrated and
2x
Up to
Efficiency
dedicated AI accelerator optimized
for the AI PC MAC array
Bandwidth

12 Enhanced
SHAVE DSPs
TOPS Embedding tokenization
used for LLMs
Accelerating LLM & transformer operations

6
Neural
compute
engines
Thank
You
Notices & Disclaimers
The preceding presentation contains product features that are currently under development. Information shown through the presentation is based on current expectations and subject to change without notice.
Results that are based on pre-production systems and components as well as results that have been estimated or simulated using an Intel Reference Platform (an internal example new system), internal Intel
analysis or architecture simulation or modeling are provided to you for informational purposes only. Results may vary based on future changes to any systems, components, specifications or configurations.
Performance varies by use, configuration and other factors. Learn more at www.intel.com/PerformanceIndex.
AI features may require software purchase, subscription or enablement by a software or platform provider, or may have specific configuration or compatibility requirements. Details at www.intel.com/AIPC.
No product or component can be absolutely secure. Intel technologies may require enabled hardware, software or service activation.
All product plans and roadmaps are subject to change without notice.
Performance hybrid architecture combines two core microarchitectures, Performance-cores (P-cores) and Efficient-cores (E-cores), on a single processor die first introduced on 12th Gen Intel® Core processors.
Select 12th Gen and newer Intel® Core processors do not have performance hybrid architecture, only P-cores or E-cores, and may have the same cache size. See ark.intel.com for SKU details, including cache
size and core frequency.
Built-in Intel® Arc GPU only available on select Intel® Core Ultra processor-powered systems; OEM enablement required.

Some images may have been altered or simulated and are for illustrative purposes only.

Intel does not control or audit third-party data. You should consult other sources to evaluate accuracy.

© Intel Corporation. Intel, the Intel logo, and other Intel marks are trademarks of Intel Corporation or its subsidiaries. Other names and brands may be claimed as the property of others.
APPENDIX
Claim # & Statement Slide # & Title/Details

SLIDE 22: Increased Efficiency & Increased Performance

Testing by Intel as of January 2024. Based on VPU-EM simulation. Power data is generated from the simulation tool based on power data
2x performance at ISO power vs.
that has been extracted from circuit simulation tools. This simulation, which is a ~100% utilization int8 network, is expected to correlate well
Meteor Lake
with silicon.

4x peak performance 4x peak performance is based on TOPS increase from MTL (11 TOPS) to LNL (48 TOPS).

SLIDE 34: NPU4 Shave DSP

4x Vector compute Based on 4x vector width increase vs. NPU3 . NPU3 has 8 FP16 Vector ops/clock, NPU4 has 32

12x overall vector performance Vector performance = 3x tiles and 4x the vector width (vs. NPU3 )

SLIDE 38: NPU 4 Performance

12x vector performance Vector performance = 3x tiles and 4x the vector width (vs. NPU3 )
TOPS calculation is # of tiles * fmax frequency * ops clock
Meteor Lake is up to 11.5 TOPS, Lunar Lake is up to 48 TOPS;
4x TOPS
Meteor Lake TOPS = (2 tiles * 1.4GHz * 4096 ops/clock)/1000
Lunar Lake TOPS = (6 tiles * 1.95GHz * 4096 ops/clock)/1000
2x IP bandwidth IP Bandwidth: Meteor Lake is 64GB/s; Lunar Lake is 136 GB/s.

SLIDE 55: Stable Diffusion v1.5

Lunar Lake vs. Meteor Lake Testing by Intel as of May 2024. Data based on Lunar Lake reference validation platform vs. Intel® Core Ultra 7 155H
performance, power and 32GB LPDDR5-6400Mhz (Meteor Lake). Calculated using open source GIMP with NPU plug in. Text Encoder, & Unet +/- are running on
efficiency ratio the NPU. VAE is running on the built-in GPU.

You might also like