0% found this document useful (0 votes)

20 views60 pages

2024 - Intel - Tech Tour TW - Lunar Lake AI Hardware Accelerators

The document discusses the advancements and trends in AI hardware, particularly focusing on Intel's Lunar Lake architecture and its multi-engine approach, which includes GPU, CPU, and NPU components. It highlights the increasing adoption of AI capabilities in PCs, the performance improvements across different generations of NPUs, and the architecture's support for various AI workloads such as text-to-video and voice recognition. Additionally, it covers the technical specifications and operational capabilities of the NPU, emphasizing its efficiency and performance enhancements for AI applications.

Uploaded by

308mao

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

20 views60 pages

2024 - Intel - Tech Tour TW - Lunar Lake AI Hardware Accelerators

Uploaded by

308mao

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 60

Lunar Lake

AI Hardware
Accelerators
TAP
Intel Fellow
AI PC Landscape

AI Client
Dynamic noise Meeting
Pose estimation Background blurring suppression Chatbots summarization

Workload
Trends Voice recognition AI assistants Text-to-video Depth estimation Style transfer

Growing in diversity
From background blurring to gen AI

App & OS integration

Recommendations Denoising Code debug Identification Text-to-music
from features in apps to OS co-pilots

Multi Modal
Transformers and diffusion

Code generation Text-to-image Realtime filters Contextual search Image upscaling

2024 2025
AI PC Trends

AI Engine ~25%
~30%
Adoption
NPU
NPU

~40%
GPU
~40%
Multi-engine adoption GPU
Macro ISV trend

GPU role significant

In ISV feature plans through ‘25
~35%
CPU
~30%
Multiple performant engines CPU
Are best fit for enabling ISV efforts

Based on internal Intel research as of May 2024.

Unmatched
AI Compute
With our Multi-Engine approach

GPU
Up to

120
Creator & gamer AI

NPU
AI assistants & gen AI

CPU
platform TOPS Light ”embedded” AI
Lunar Lake

GPU
AI Engine

XMX 67
GPU Xe Matrix peak
architecture Extensions TOPs

All tops are Int8 on high end SKU, will vary based on SKU
Lunar Lake

CPU
AI Engine

P-core &
E-core
VNNI &
AVX
5
CPU peak
architecture AI instructions TOPs

All tops are Int8 on high end SKU, will vary based on SKU
Lunar Lake
NPU Deepdive

Darren Crews
Sr Principal Engineer, NPU Lead Architect
Lunar Lake

NPU
AI Engine

NPU 4 2x 48
Power Peak
Architecture efficiency TOPs

All tops are Int8 on high end SKU, will vary based on SKU
Continuous NPU Improvements
Across 4 generations of IP

48 pTOPs
Proven foundations
based on three prior generations

Higher compute capacity

to support growing number of use cases

11.5 pTOPs NPU 4

Increased efficiency
2024
to support longer battery life

7 pTOPs NPU 3
0.5 pTOPs 2023
NPU 2
2021
NPU 1
2018

All tops are Int8 on high end SKU, will vary based on SKU
Proven foundations
based on three prior generations

NPU 4 Higher compute capacity

to support growing number of use cases
2024
Increased efficiency
to support longer battery life
Scaling the NPU

Increase number of engines

Increase frequency

NPU 3 NPU 4
Improve architecture
What is a TOP?
Trillions of Operations per Second

1 Multiply & Clock

1 Accumulate Frequency
(MAC) (Hz)
How Many AI TOPS?
Operations per Second Trillions

1 Multiply & Clock

1 Accumulate frequency 1012
(MAC) (Hz)
How Many AI TOPS?
Operations Frequency Trillions

1 Multiply &
1 Accumulate 1012
(MAC)
How Many AI TOPS in
Meteor Lake’s NPU?
Operations Frequency Trillions TOPs

1 Multiply & Clock

1 Accumulate Frequency 1012
(MAC) (Hz)
How Many AI TOPS in
Meteor Lake’s NPU?
Operations Frequency Trillions

2
ops/MAC 1.4
1012 11.5
*4096 GHz
MAC/cycle
TOPs
Operation
Types
Overview Scalar Vector Matrix

Complexity 1 N N2

Conditional SoftMax Convolution

Example functions
Looping Activation functions Matrix multiplication

Occurrence in AI
Low Very high Very high
Operation
Types
Overview Scalar Vector Matrix

Complexity 1 N N2

Conditional SoftMax Convolution

Example functions
Looping Activation functions Matrix multiplication
TOPs

Occurrence in AI
Low Very high Very high
Scaling the NPU

Increase number of engines

NPU 3 NPU 4
NPU 3 4K 2 NPU 4 12K 6
MACs NCEs MACs NCEs
Scaling the NPU

Increase frequency

NPU 3 NPU 4
Increased
Efficiency & NPU 4

Increased
Performance
4x

Performance
peak
performance
Increased clock 2x
perf at
ISO power1

New node
NPU 3
Architecture
improvements

Power

1 Based on pre-production simulation data of a real network. See backup for details.
Scaling the NPU

NPU 3 NPU 4
Improve architecture
NPU 4 NPU bandwidth
Efficiency
Architecture of matrix
improvements compute

Vector performance Increased number of tiles

NPU 4
Architecture
overview
NPU 4
Architecture
overview

Global control & MMU

NPU 4
Architecture
overview

Global control & MMU

DMA & scratchpad RAM

NPU 4
Architecture
overview

Global control & MMU

DMA & scratchpad RAM

Neural compute engines

NPU 4
Neural
compute
engine

Specialized engines
Matrix + Vector

Inference pipeline
MAC arrays + fixed function

Programmable DSPs
NPU 4
Inference
pipeline

Efficient matrix
multiplication

Activation
function support

Data conversion and

re-layout support
NPU 4 INT8
MAC array 16x16x8

Matrix multiplication
& convolution
FP16
16x16x4
2048 MAC/cycle int8
1024 MAC/cycle FP16

Up to 2x1 efficiency
driving better perf/watt

1See details in backup

NPU 4
Activation
functions
ReLU 𝓍 ≜ max (0, 𝓍)

Multiple functions
Supported

FP precision
Support
NPU 4
Data conversion

min (Xf) 0 max (Xf)

Datatype conversion

Fused operations

0 255
Output data re-layout
NPU 4
SHAVE DSP

Upgraded
SHAVE DSP
4x vector compute

12x overall
vector perf
improves transformer
/LLM performance

See details in backup

NPU 4
SHAVE DSP

512-bit 4x 4x
Vector register Performance Bandwidth
file size to and from
Vector unit
SHAVE DSP

NPU 3
SHAVE DSP
SHAVE NPU3 SHAVE DSP FP16

128bit

DSP
VRF
FP16 FP16 FP16 FP16 FP16 FP16 FP16 FP16

VAU
Vector increase FP16 FP16 FP16 FP16 FP16 FP16 FP16 FP16

8 FP16 vector ops/clock per DSP

VRF

NPU4 SHAVE DSP FP16

512bit
VRF
FP16 FP16 FP16 FP16 FP16 FP16 FP16 FP16 FP16 FP16 FP16 FP16 FP16 FP16 FP16 FP16 FP16 FP16 FP16 FP16 FP16 FP16 FP16 FP16 FP16 FP16 FP16 FP16 FP16 FP16 FP16 FP16

VAU
FP16 FP16 FP16 FP16 FP16 FP16 FP16 FP16 FP16 FP16 FP16 FP16 FP16 FP16 FP16 FP16 FP16 FP16 FP16 FP16 FP16 FP16 FP16 FP16 FP16 FP16 FP16 FP16 FP16 FP16 FP16 FP16

VRF
32 FP16 vector ops/clock per DSP
NPU 4
DMA engine

2x DMA bandwidth
improves network
performance especially LLMs

New functions
Embedding tokenization

See backup for details.

12x
Vector performance

NPU 4
Performance
4x
2x
TOPS
Intel - NPU 4

Intel - NPU 3 IP bandwidth

See backup for details.

Text
Transformer Text

Image
Use Cases Image

Video Translation Video

Speech Speech
Generation

Sound Sound

Classification
Brainwaves Brainwaves
Transformer Model Architecture

Encoding
Positional
Encoder +
Encoder Translation
Decoder

Generation Decoder
Inputs

Classification Encoder
Input Multi-Head Add & Feed Add &
Embedding Attention Norm Forward Norm

Masked
Outputs Add & Multi-Head Add & Feed Add &
Multi-Head Linear SoftMax
Embedding Norm Attention Norm Forward Norm
Attention

Outputs
(Shifted right)
Encoding
Positional

Decoder
Input Multi-Head Add & Feed Add &
Embedding Attention Norm Forward Norm

Masked
Outputs Add & Multi-Head Add & Feed Add &
Multi-Head Linear SoftMax
Embedding Norm Attention Norm Forward Norm
Attention

Transformer
Architecture
on Intel’s NPU
Input Multi-Head Add & Feed Add &
Embedding Attention Norm Forward Norm

Masked
Outputs Add & Multi-Head Add & Feed Add &
Multi-Head Linear SoftMax
Embedding Norm Attention Norm Forward Norm
Attention

Transformer
Architecture
on Intel’s NPU
Input Multi-Head Add & Feed Add &
Embedding Attention Norm Forward Norm

Masked
Outputs Add & Multi-Head Add & Feed Add &
Multi-Head Linear SoftMax
Embedding Norm Attention Norm Forward Norm
Attention

Transformer
Architecture
on Intel’s NPU
Multi- Head
Attention Linear

Flowchart Concat Attention

score
?

calculation chips 1 chips

(7.4, 9) (9,9)

MatMul
0.2 0.8

Similarity
Multi-Head SoftMax normalization ?
Attention
0.25 1.0
MatMul

Word vector
similarity ?
dot-product
Linear Linear Linear
chips laptop

𝑄𝐾 𝑇
𝐴𝑡𝑡𝑒𝑛𝑡𝑖𝑜𝑛 𝑄, 𝐾, 𝑉 = 𝑆𝑜𝑓𝑡𝑀𝑎𝑥 𝑉
𝑑𝑘
Multi- Head
Attention
Flowchart

Word vector
similarity ? MatMul
dot-product chips laptop

𝑄𝐾 𝑇
𝐴𝑡𝑡𝑒𝑛𝑡𝑖𝑜𝑛 𝑄, 𝐾, 𝑉 = 𝑆𝑜𝑓𝑡𝑀𝑎𝑥 𝑉
𝑑𝑘
Multi- Head
Attention
Flowchart

0.2 0.8
SoftMax
Similarity ?
normalization
0.25 1.0

𝑄𝐾 𝑇
𝐴𝑡𝑡𝑒𝑛𝑡𝑖𝑜𝑛 𝑄, 𝐾, 𝑉 = 𝑆𝑜𝑓𝑡𝑀𝑎𝑥 𝑉
𝑑𝑘
Multi- Head
Attention
Flowchart
MatMul
Attention
?
score
calculation chips 1 chips
(7.4, 9) (9,9)

𝑄𝐾 𝑇
𝐴𝑡𝑡𝑒𝑛𝑡𝑖𝑜𝑛 𝑄, 𝐾, 𝑉 = 𝑆𝑜𝑓𝑡𝑀𝑎𝑥 𝑉
𝑑𝑘
Stable Diffusion Architecture

Text prompt U-Net Image

understanding diffusion decoder

“Cute kitten
Text encoder U-Net+ U-Net- VAE
with a pink bow” Output
Stable Diffusion Architecture

U-Net diffusion

Step 1 Step 2 Step 3 Step 4 Step N

Text prompt Image

understanding decoder

Text encoder VAE

Stable Diffusion Architecture

U-Net diffusion

Denoising Step 1 Step N

Transpose Convolution

Transpose Convolution
Text prompt Image
understanding decoder
Concat
Attention Attention Attention Attention
Layer Layer Layer Layer

Text encoder VAE

Switch
Accelerating Multi-Head Attention
Performance on U-Net
0ms 50ms 100ms 150ms 200ms 250ms 300ms 350ms

MAC ARRAY
SHAVE DSP
SHAVE DSP

MAC ARRAY
up to
SHAVE DSP
SHAVE DSP 9x faster
MAC ARRAY
SHAVE DSP attention
calculation
SHAVE DSP

MAC ARRAY
SHAVE DSP
SHAVE DSP

*See details in backup

Accelerating Multi-Head Attention
Performance on U-Net

QK (QK)*V
SoftMax SoftMax
MatMul MatMul

MAC ARRAY
SHAVE DSP
SHAVE DSP

MAC ARRAY
SHAVE DSP
SHAVE DSP
Stable Diffusion
Stable Diffusion Demo

Text prompt U-Net Image

understanding diffusion decoder

“Cute kitten
Text encoder U-Net+ U-Net- VAE
with a pink bow” Output

Meteor Lake CPU NPU GPU

Lunar Lake NPU NPU GPU

Data Type FP16 INT8 FP16

Stable
Diffusion v1.5 20.9s
20 Iterations

42 Inferences
Text Encoder (1)
+ U-Net+ (20) 11.2W
+ U-Net– (20)
+ VAE Decoder (1) 9.0W

5.8s
Meteor Lake 2.9x
Lunar Lake 1x

Performance Power Efficiency ratio

Image generation time Avg SoC package power Higher is better
Lower is better Lower is better

See backup for details. Results may vary.

Next Gen Native activation
function & data

NPU 4 conversion support

48
Largest integrated and
2x
Up to
Efficiency
dedicated AI accelerator optimized
for the AI PC MAC array
Bandwidth

12 Enhanced
SHAVE DSPs
TOPS Embedding tokenization
used for LLMs
Accelerating LLM & transformer operations

6
Neural
compute
engines
Thank
You
Notices & Disclaimers
The preceding presentation contains product features that are currently under development. Information shown through the presentation is based on current expectations and subject to change without notice.
Results that are based on pre-production systems and components as well as results that have been estimated or simulated using an Intel Reference Platform (an internal example new system), internal Intel
analysis or architecture simulation or modeling are provided to you for informational purposes only. Results may vary based on future changes to any systems, components, specifications or configurations.
Performance varies by use, configuration and other factors. Learn more at www.intel.com/PerformanceIndex.
AI features may require software purchase, subscription or enablement by a software or platform provider, or may have specific configuration or compatibility requirements. Details at www.intel.com/AIPC.
No product or component can be absolutely secure. Intel technologies may require enabled hardware, software or service activation.
All product plans and roadmaps are subject to change without notice.
Performance hybrid architecture combines two core microarchitectures, Performance-cores (P-cores) and Efficient-cores (E-cores), on a single processor die first introduced on 12th Gen Intel® Core processors.
Select 12th Gen and newer Intel® Core processors do not have performance hybrid architecture, only P-cores or E-cores, and may have the same cache size. See ark.intel.com for SKU details, including cache
size and core frequency.
Built-in Intel® Arc GPU only available on select Intel® Core Ultra processor-powered systems; OEM enablement required.

Some images may have been altered or simulated and are for illustrative purposes only.

Intel does not control or audit third-party data. You should consult other sources to evaluate accuracy.

© Intel Corporation. Intel, the Intel logo, and other Intel marks are trademarks of Intel Corporation or its subsidiaries. Other names and brands may be claimed as the property of others.
APPENDIX
Claim # & Statement Slide # & Title/Details

SLIDE 22: Increased Efficiency & Increased Performance

Testing by Intel as of January 2024. Based on VPU-EM simulation. Power data is generated from the simulation tool based on power data
2x performance at ISO power vs.
that has been extracted from circuit simulation tools. This simulation, which is a ~100% utilization int8 network, is expected to correlate well
Meteor Lake
with silicon.

4x peak performance 4x peak performance is based on TOPS increase from MTL (11 TOPS) to LNL (48 TOPS).

SLIDE 34: NPU4 Shave DSP

4x Vector compute Based on 4x vector width increase vs. NPU3 . NPU3 has 8 FP16 Vector ops/clock, NPU4 has 32

12x overall vector performance Vector performance = 3x tiles and 4x the vector width (vs. NPU3 )

SLIDE 38: NPU 4 Performance

12x vector performance Vector performance = 3x tiles and 4x the vector width (vs. NPU3 )
TOPS calculation is # of tiles * fmax frequency * ops clock
Meteor Lake is up to 11.5 TOPS, Lunar Lake is up to 48 TOPS;
4x TOPS
Meteor Lake TOPS = (2 tiles * 1.4GHz * 4096 ops/clock)/1000
Lunar Lake TOPS = (6 tiles * 1.95GHz * 4096 ops/clock)/1000
2x IP bandwidth IP Bandwidth: Meteor Lake is 64GB/s; Lunar Lake is 136 GB/s.

SLIDE 55: Stable Diffusion v1.5

Lunar Lake vs. Meteor Lake Testing by Intel as of May 2024. Data based on Lunar Lake reference validation platform vs. Intel® Core Ultra 7 155H
performance, power and 32GB LPDDR5-6400Mhz (Meteor Lake). Calculated using open source GIMP with NPU plug in. Text Encoder, & Unet +/- are running on
efficiency ratio the NPU. VAE is running on the built-in GPU.

E4ds ST Webinar Presentation
No ratings yet
E4ds ST Webinar Presentation
61 pages
Generative AI at The Edge
100% (1)
Generative AI at The Edge
37 pages
Artificial Intelligence Hardware Design - Challenges and Solutions
100% (2)
Artificial Intelligence Hardware Design - Challenges and Solutions
233 pages
Npu AI
No ratings yet
Npu AI
54 pages
HC2023 Qualcomm Hexagon NPU
No ratings yet
HC2023 Qualcomm Hexagon NPU
19 pages
Welding Repair Sandvik
100% (2)
Welding Repair Sandvik
42 pages
PE Implementation Paper
No ratings yet
PE Implementation Paper
2 pages
Introduction To CUDA
No ratings yet
Introduction To CUDA
51 pages
Difference Between CPU, GPU, TPU and NPU - by Abhishek Jain - Medium
No ratings yet
Difference Between CPU, GPU, TPU and NPU - by Abhishek Jain - Medium
14 pages
NVIDIA GPU Computing - A Journey From PC Gaming To Deep Learning
100% (1)
NVIDIA GPU Computing - A Journey From PC Gaming To Deep Learning
91 pages
Tesi
No ratings yet
Tesi
73 pages
GPU Bootcamp Samhar
100% (1)
GPU Bootcamp Samhar
96 pages
Basic Design Approaches To Accelerating Deep Neural Networks
No ratings yet
Basic Design Approaches To Accelerating Deep Neural Networks
93 pages
Neural Network Accelerators: CS223 Computer Architecture & Organization
No ratings yet
Neural Network Accelerators: CS223 Computer Architecture & Organization
45 pages
GTC2025 Keynote
No ratings yet
GTC2025 Keynote
73 pages
Vision Processing Unit
No ratings yet
Vision Processing Unit
44 pages
DNN Accelerators
No ratings yet
DNN Accelerators
29 pages
Lec5 Tpu
No ratings yet
Lec5 Tpu
44 pages
Lecture02 - High-Level Digital Design Automation
No ratings yet
Lecture02 - High-Level Digital Design Automation
34 pages
Advanced Topics For AI
No ratings yet
Advanced Topics For AI
30 pages
Women Empowerment Aunt Jennifer Tigers
No ratings yet
Women Empowerment Aunt Jennifer Tigers
21 pages
High-Performance Hardware For Machine Learning - 0916
No ratings yet
High-Performance Hardware For Machine Learning - 0916
68 pages
Faults NNAccelerator
No ratings yet
Faults NNAccelerator
27 pages
Tutorial On DNN 6 of 9 Network and Hardware Co Design
No ratings yet
Tutorial On DNN 6 of 9 Network and Hardware Co Design
60 pages
Tensor Processing Unit
50% (2)
Tensor Processing Unit
23 pages
Benini ISC2023 Paving The Road For Riscv
No ratings yet
Benini ISC2023 Paving The Road For Riscv
40 pages
CUDA
No ratings yet
CUDA
54 pages
EECS251Leture-JennyHuang 2021
No ratings yet
EECS251Leture-JennyHuang 2021
67 pages
5 Lecture 28 01 25
No ratings yet
5 Lecture 28 01 25
47 pages
2021-Jouppi 10 Lessions
No ratings yet
2021-Jouppi 10 Lessions
14 pages
Ten Lessons From Three Generations Shaped Google S Tpuv4i
No ratings yet
Ten Lessons From Three Generations Shaped Google S Tpuv4i
40 pages
Manual Triplex Pump
100% (1)
Manual Triplex Pump
45 pages
00 Introduction
No ratings yet
00 Introduction
41 pages
GPGPU
No ratings yet
GPGPU
139 pages
IntroductionToAISystems
No ratings yet
IntroductionToAISystems
29 pages
An Energy-Efficient Precision-Scalable ConvNet Processor in 40-Nm CMOS-1
No ratings yet
An Energy-Efficient Precision-Scalable ConvNet Processor in 40-Nm CMOS-1
12 pages
Intro DL 05
No ratings yet
Intro DL 05
47 pages
Research Paper
No ratings yet
Research Paper
7 pages
Motivation For and Evaluation of The First Tensor Processing Unit
No ratings yet
Motivation For and Evaluation of The First Tensor Processing Unit
10 pages
Ta 01 Sharma Crimmins Pres User
No ratings yet
Ta 01 Sharma Crimmins Pres User
46 pages
Vision Processing Unit (VPU) For Edge AI Inference in 2021 - Viso - Ai
No ratings yet
Vision Processing Unit (VPU) For Edge AI Inference in 2021 - Viso - Ai
9 pages
Hc2024 Amd Vpeng
No ratings yet
Hc2024 Amd Vpeng
36 pages
Nvidia Update For Lenovo
No ratings yet
Nvidia Update For Lenovo
30 pages
Notes
No ratings yet
Notes
29 pages
CARRV2021 Slides 67 Zhan
No ratings yet
CARRV2021 Slides 67 Zhan
13 pages
EE292A Lecture 2.ML - Hardware
No ratings yet
EE292A Lecture 2.ML - Hardware
61 pages
Report
No ratings yet
Report
9 pages
Tensor Processing Unit
100% (1)
Tensor Processing Unit
15 pages
L 0017398760 PDF
No ratings yet
L 0017398760 PDF
24 pages
An Overview of API 579-1 ASME FFS-1 Fitness-For-Service Standard
No ratings yet
An Overview of API 579-1 ASME FFS-1 Fitness-For-Service Standard
57 pages
Tutorial On DNN 4 of 9 DNN Accelerator Architectures PDF
No ratings yet
Tutorial On DNN 4 of 9 DNN Accelerator Architectures PDF
73 pages
Google TPU
No ratings yet
Google TPU
27 pages
w1 - Machine Learning Hardware Design For Efficiency, Flexibility, and Scalability (Feature)
No ratings yet
w1 - Machine Learning Hardware Design For Efficiency, Flexibility, and Scalability (Feature)
19 pages
AI Chips Overview - TPU, NPU, GPU, and FPGA - Pynomial
No ratings yet
AI Chips Overview - TPU, NPU, GPU, and FPGA - Pynomial
9 pages
Deep Learning Cookbook
No ratings yet
Deep Learning Cookbook
24 pages
Author Biographies Preface Acknowledgments Table of Figures
No ratings yet
Author Biographies Preface Acknowledgments Table of Figures
6 pages
Transforming Edge Ai With Npus in Microcontrollers
No ratings yet
Transforming Edge Ai With Npus in Microcontrollers
12 pages
A Survey Comparing Specialized Hardware and Evolution in TPUs For Neural Networks
No ratings yet
A Survey Comparing Specialized Hardware and Evolution in TPUs For Neural Networks
7 pages
Hardware Accleration For ML
No ratings yet
Hardware Accleration For ML
26 pages
Architectural Support For Machine Learning Accelerators-1
No ratings yet
Architectural Support For Machine Learning Accelerators-1
4 pages
Experiment 1 - Plot Wind Rose
100% (9)
Experiment 1 - Plot Wind Rose
11 pages
Systolic Array Architecture For Educational Use
No ratings yet
Systolic Array Architecture For Educational Use
6 pages
Power Efficientreconfigurable Accelerator For Deep Convolutional Neural Networks
No ratings yet
Power Efficientreconfigurable Accelerator For Deep Convolutional Neural Networks
6 pages
Special Issue On Contemporary Industry Products 2024
No ratings yet
Special Issue On Contemporary Industry Products 2024
2 pages
An Introductory Course in Stochastic Processes
100% (2)
An Introductory Course in Stochastic Processes
105 pages
Nailcare DLL - Edited
100% (3)
Nailcare DLL - Edited
4 pages
ADVR-16: Universal Hybrid Analog-Digital Voltage Regulator Operation Manual
0% (1)
ADVR-16: Universal Hybrid Analog-Digital Voltage Regulator Operation Manual
6 pages
Was Karl Marx A Black Man?
50% (2)
Was Karl Marx A Black Man?
19 pages
Best IELTS Coaching Institutes in Chandigarh
No ratings yet
Best IELTS Coaching Institutes in Chandigarh
15 pages
Tema 3
No ratings yet
Tema 3
28 pages
Miyachi Ipb5000a
No ratings yet
Miyachi Ipb5000a
86 pages
Industrial and Organizational Psychology Research and Practice 8th Edition by Paul Spector Test Bank
No ratings yet
Industrial and Organizational Psychology Research and Practice 8th Edition by Paul Spector Test Bank
31 pages
Nichols 2009 Health, Climate Change and Sustainability. A Systematic Review and Thematic Analysis of The Literature
No ratings yet
Nichols 2009 Health, Climate Change and Sustainability. A Systematic Review and Thematic Analysis of The Literature
26 pages
RAL Guidelines ISO Format
No ratings yet
RAL Guidelines ISO Format
2 pages
Understanding Psychology: Bab 1 Pengenalan
No ratings yet
Understanding Psychology: Bab 1 Pengenalan
40 pages
Nike+ Sensor User Guide
No ratings yet
Nike+ Sensor User Guide
28 pages
EASA FORM 19 2012 - Editable 2.doc - 20120727120309
No ratings yet
EASA FORM 19 2012 - Editable 2.doc - 20120727120309
2 pages
Consumer Culture Theory (CCT) - Twenty Years of Research
No ratings yet
Consumer Culture Theory (CCT) - Twenty Years of Research
17 pages
1.write A Program To Print ' 100 Times Using Linefeed and Carriage Return. Code
No ratings yet
1.write A Program To Print ' 100 Times Using Linefeed and Carriage Return. Code
7 pages
541 Automatski Dopunjac Sistema PDF
No ratings yet
541 Automatski Dopunjac Sistema PDF
2 pages
Assignment For Dynamic Programming
No ratings yet
Assignment For Dynamic Programming
5 pages
Installcheck Cca 240eura
No ratings yet
Installcheck Cca 240eura
2 pages
PDF Examples
No ratings yet
PDF Examples
9 pages
Research Project Cobblestone
No ratings yet
Research Project Cobblestone
5 pages
Home Depot Resume
No ratings yet
Home Depot Resume
2 pages
Deep Learning - Model Paper
No ratings yet
Deep Learning - Model Paper
2 pages
OOP LAB EXER 2 - Classes and Objects
No ratings yet
OOP LAB EXER 2 - Classes and Objects
3 pages
Training Proposal Google Apps
No ratings yet
Training Proposal Google Apps
2 pages
BBRIP - Monitoring Tool
No ratings yet
BBRIP - Monitoring Tool
2 pages
Learning PyTorch 2.0, Second Edition: Utilize PyTorch 2.3 and CUDA 12 to experiment neural networks and deep learning models
From Everand
Learning PyTorch 2.0, Second Edition: Utilize PyTorch 2.3 and CUDA 12 to experiment neural networks and deep learning models
Matthew Rosch
No ratings yet

2024 - Intel - Tech Tour TW - Lunar Lake AI Hardware Accelerators

Uploaded by

2024 - Intel - Tech Tour TW - Lunar Lake AI Hardware Accelerators

Uploaded by

Lunar Lake

App & OS integration

Code generation Text-to-image Realtime filters Contextual search Image upscaling

GPU role significant

Based on internal Intel research as of May 2024.

Higher compute capacity

11.5 pTOPs NPU 4

NPU 4 Higher compute capacity

Increase number of engines

1 Multiply & Clock

1 Multiply & Clock

1 Multiply & Clock

Conditional SoftMax Convolution

Conditional SoftMax Convolution

Increase number of engines

Vector performance Increased number of tiles

Global control & MMU

Global control & MMU

DMA & scratchpad RAM

Global control & MMU

DMA & scratchpad RAM

Neural compute engines

Data conversion and

1See details in backup

min (Xf) 0 max (Xf)

See details in backup

8 FP16 vector ops/clock per DSP

NPU4 SHAVE DSP FP16

See backup for details.

Intel - NPU 3 IP bandwidth

See backup for details.

Video Translation Video

Flowchart Concat Attention

calculation chips 1 chips

Text prompt U-Net Image

Step 1 Step 2 Step 3 Step 4 Step N

Text prompt Image

Text encoder VAE

Denoising Step 1 Step N

Text encoder VAE

*See details in backup

Text prompt U-Net Image

Meteor Lake CPU NPU GPU

Lunar Lake NPU NPU GPU

Data Type FP16 INT8 FP16

Performance Power Efficiency ratio

See backup for details. Results may vary.

NPU 4 conversion support

SLIDE 22: Increased Efficiency & Increased Performance

SLIDE 34: NPU4 Shave DSP

SLIDE 38: NPU 4 Performance

SLIDE 55: Stable Diffusion v1.5

You might also like