0% found this document useful (0 votes)

46 views9 pages

2023-05 On Device AI - Double-Edged Sword

On-device AI has potential benefits but also faces significant challenges. Large models require too much memory and compute to run on devices, but hardware is advancing rapidly. The memory needed for even smaller models exceeds what most devices can provide, limiting possible model sizes. Future hardware may help address these issues and better enable on-device AI.

Uploaded by

Live Think Curate

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

46 views9 pages

2023-05 On Device AI - Double-Edged Sword

Uploaded by

Live Think Curate

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 9

On Device AI – Double-Edged Sword

Scaling limits, model size constraints, why server based AI wins, and future
hardware improvements
DYLAN PATEL AND SOPHIA WISDOM
13 MAY 2023 ∙ PAID

61 5 Share

The most discussed part of the AI industry is chasing ever-larger language models
that can only be developed by mega-tech companies. While training these models is
costly, deploying them is even more difficult in some regards. In fact, OpenAI’s GPT-
4 is so large and compute-intensive that just running inference requires multiple
~$250,000 servers, each with 8 GPUs, loads of memory, and a ton of high-speed
networking. Google takes a similar approach with its full-size PaLM model, which
requires 64 TPUs and 16 CPUs to run. Meta’s largest recommendation models in
2021 required 128 GPUs to serve users. The world of ever more powerful models will
continue to proliferate, especially with AI-focused clouds and ML Ops firms such as
MosaicML assisting enterprises with developing and deploying LLMs…but bigger is
not always better.

There is a whole different universe of the AI industry that seeks to reject big iron
compute. The open-source movement around small models that can run on client
device is probably the 2nd most discussed part of the industry. While models on the
scale of GPT-4 or the full PaLM could never hope to run on laptops and smartphones,
even with 5 more years of hardware advancement due to the memory wall, there is a
thriving ecosystem of model development geared towards on-device inference.

Today we are going to discuss these smaller models on client-side devices such as
laptops and phones. This discussion will focus on the gating factors for inference
performance, the fundamental limits to model sizes, and how hardware development
going forward will establish the boundaries of development here.

✖ Our use of cookies

We use necessary cookies to make our site work. We also set performance and functionality cookies that
help us make improvements by measuring traffic on our site. For more detailed information about the
cookies we use, please see our privacy policy.
SemiAnalysis is an ad-free reader-supported
publication. To receive new posts and support,
consider becoming a subscriber.

Why Local Models Are Needed

The potential use cases for on-device AI are broad and varied. People want to break
free of the tech giants owning all their data. Google, Meta, Baidu, and Bytedance, 4 of
the 5 leaders in AI, have basically their entire current profitability predicated on
using user data to target ads. Look no further than the entire IDFA kerfuffle as to
how important the lack of privacy is for these firms. On-device AI can help solve this
problem while also enhancing capabilities with unique per-user alignment and
tuning.

Giving smaller language models the performance of last-generation large models is

one of the most important developments over the last few months of AI.

A simple, easily solved example is that of on-device speech-to-text. It is quite bad,

even from the current class-leading Google Pixel smartphones. The latency from
going to the cloud-based models is also very jarring for natural usage and depends
heavily on a good internet connection. The world of on-device speech-to-text is
rapidly changing as models such as OpenAI Whisper run on mobile devices. (Google
IO also showed these capabilities may be upgraded massively soon.)

A bigger example is that of Siri, Alexa, etc, being quite horrible as personal
assistants. Large language models with the assistance of natural voice synthesizing
AI could unlock far more human and intelligent AI assistants that can assist with
your life. From creating calendar events to summarizing conversations to search,
there will be a personal assistant on every device based on a multi-modal language
model.
✖ These
Our use models are already far more capable than Siri, Google Assistant, Alexa,
of cookies
We use necessary cookies to make our site work. We also set performance and functionality cookies that
Bixby, etc, and we are still in the very early innings.
help us make improvements by measuring traffic on our site. For more detailed information about the
cookies we use, please see our privacy policy.
In some ways, generative AI is rapidly becoming a bimodal distribution with massive
foundational models and much smaller models that can run on client devices,
garnering the majority of investment and a great chasm in between.

The Fundamental Limits of On-Device Inference

While the promise of on-device AI is undoubtedly alluring, there are some
fundamental limitations that make local inference more challenging than most
anticipated. The vast majority of client devices do not and will never have a dedicated
GPU, so all these challenges must be solved on the SoC. One of the primary concerns
is the significant memory footprint and computational power required for GPT-style
models. The computational requirements, while high, is a problem that will be solved
rapidly over the next 5 years with more specialized architectures, Moore’s Law
scaling to 3nm/2nm, and the 3D stacking of chips.

The highest-end client mobile devices will come with ~50 billion transistors and
more than enough TFLOP/s for on-device AI due to architectural innovations that
are in the pipeline at firms like Intel, AMD, Apple, Google, Samsung, Qualcomm,
and MediaTek. To be clear, none of their existing client AI accelerators are well-
suited for transformers, but that will change in a few years. These advancements in
the digital logic side of chips will solve the computing problem, but they cannot
tackle the true underlying problem of the memory wall and data reuse.

GPT-style models are trained to predict the next token (~= word) given the previous
tokens. To generate text with them, you feed it the prompt, then ask it to predict the
next token, then append that generated token into the prompt, then ask it to predict
the next token, and keep on going. In order to do this, you have to send all the
parameters from RAM to the processor every time you predict the next token. The
first problem is that you have to store all these parameters as close as possible to the
compute. The other problem is that you have to be able to load these parameters
from compute onto the chip exactly when you need them.

✖ Our use of cookies

We use necessary cookies to make our site work. We also set performance and functionality cookies that
help us make improvements by measuring traffic on our site. For more detailed information about the
cookies we use, please see our privacy policy.
In the memory hierarchy, caching frequently accessed data on chip is common across
most workloads. The problem with this approach for on-device LLM is that the
parameters take too large of memory space to cache. A parameter stored in a 16-bit
number format such as FP16 or BF16 is 2 bytes. Even the smallest “decent”
generalized large language models is LLAMA at a minimum of 7 billion parameters.
The larger versions are significantly higher quality. To simply run this model
requires at minimum 14GB of memory at 16-bit percision. While there are a variety
of techniques to reduce memory capacity, such as transfer learning, sparsification,
and quantization, these don’t come for free and do impact model accuracy.

Furthermore, this 14GB ignores other applications, operating system, and other
overhead related to activations/kv cache. This puts an immediate limit on the size of
the model that a developer can use to deploy on-device AI, even if they can assume
the client endpoint has the computational oomph required. Storing 14GB of
parameters on a client-side processor is physically impossible. The most common
typeOur
✖ ofuse
on-chip memory is SRAM, which even on TSMC 3nm is only ~0.6GB per
of cookies
100mm^2.
We use necessary cookies to make our site work. We also set performance and functionality cookies that
help us make improvements by measuring traffic on our site. For more detailed information about the
cookies we use, please see our privacy policy.
For reference, that is about the same size chip as the upcoming iPhone 15 Pro’s A17
and ~25% smaller than the upcoming M3. Furthermore, this figure is without
overhead from assist circuity, array inefficiency, NOCs, etc. Large amounts of local
SRAM will not work for client inference. Emerging memories such as FeRAM and
MRAM do bring some hope for a light at the end of the tunnel, but they are quite far
away from being productized on the scale of Gigabytes.

The next layer down the hierarchy is DRAM. The highest-end iPhone, the 14 Pro
Max, has 6GB of RAM, but the modal iPhone has 3GB RAM. While high-end PC’s
will have 16GB+, the majority of new sales are at 8GB of RAM. The typical client
device cannot run a 7 billion parameter model quantized to FP16!

This then brings up the question. Why can’t we go down another layer in the
hierarchy? Can we run these models off of NAND-based SSDs instead of off RAM?

Unfortunately, this is far too slow. A 7 billion parameter model at FP16 requires
14GB/s of IO just to stream the weights in to produce 1 token (~4 characters)! The
fastest PC storage drives are at best 6GB/s, but most phones and PCs come in south
of 1GB/s. At 1GB/s, at 4-bit quantization, the biggest model that can be run would
still only be in the range of ~2 billion parameters, and that’s with pegging the SSD at
max for only 1 application with no regard for any other use-case.

✖ Our use of cookies

We use necessary cookies to make our site work. We also set performance and functionality cookies that
help us make improvements by measuring traffic on our site. For more detailed information about the
cookies we use, please see our privacy policy.
Unless you want to be stuck waiting 7 seconds for half a word to be spit out on the
average device, storing parameters in storage is not an option. They must be in RAM.

Limit On Model Size

The average person reads at ~250 words per minute. As a lower bound for good user
experience, an on-device AI must generate 8.33 tokens per second, or once every
120ms. Skilled speed readers can reach 1,000 words per minute, so for an upper
bound, on-device AI must be capable of generating 33.3 tokens per second, or once
per 30ms. The chart below assumes the lower bound of average reading speed, not
speed reading.

✖ Our use of cookies

We use necessary cookies to make our site work. We also set performance and functionality cookies that
help us make improvements by measuring traffic on our site. For more detailed information about the
cookies we use, please see our privacy policy.
If we conservatively assume normal, non-AI applications as well as activations/kv
cache consume half of all bandwidth, then the largest feasible model size on an
iPhone 14 is ~1 billion FP16 parameters, or ~4 billion int4 parameters. This is the
fundamental limits on smartphone-based LLMs. Any larger would exclude such a
large part of the installed base, that it wouldn’t be able to be adopted.

This is a fundamental limit on how large and powerful local AI can get. Perhaps a
company like Apple can use that to upsell newer, more expensive phones with more
advanced AI, but that is still a while away. With the same assumptions as above, on
PC, Intel’s top-of-the-line 13th generation i9 CPUs and Apple’s M2 are capped out at
about ~3 to ~4 billion parameters.

In general, these are just the lower bounds for consumer devices. To repeat, we
ignored multiple factors, including using theoretical IO speed, which is never
achieved or activations/kv cache for simplicity sake. Those only push BW
requirements up more, and constrain model size down further. We will talk more
below about innovative hardware platforms that will come next year that can help
reshape the landscape, but the memory wall limits the majority of current and future
devices.

Why Server-Side AI Wins

✖
DueOur
touse
theofextreme
cookies
memory capacity and bandwidth requirements, generative AI is
We use necessary cookies to make our site work. We also set performance and functionality cookies that
affected
help usby theimprovements
make memory wall more than
by measuring any
traffic other
on our site.application before
For more detailed it. Onabout
information client
the
inference, with
cookies we generative
use, please see our text models,
privacy policy. there is almost always a batch size of 1. Each
subsequent token requires the prior tokens/prompt to be inputted, meaning every
time you load a parameter onto the chip from memory, you only amortize the cost of
loading the parameter for only 1 generated token. There are no other users to spread
this bottleneck with. The memory wall also exists on server-side compute, but each
time a parameter is loaded, it can be amortized over multiple generated tokens for
multiple users (batch size).

Our data shows that HBM memory is nearly half the manufacturing costs of a server-
class AI chip like the H100 or TPUv5. While client-side compute does get to use
significantly cheaper DDR and LPDDR memory (~4x per GB), that memory cost
cannot be amortized over multiple concurrent inferences. Batch size cannot be
pumped to infinity because that introduces another difficult problem, which is that
any single token has to wait for every other token to be processed before it can
append its results and work on the generation of the new token.

This is solved by splitting the model across many chips. The above chart is the
latency for generating 20 tokens. Conveniently, the PaLM model hits the 6.67 tokens
per second, or a ~200 words per minute minimum viable target, with 64 chips
running inference at a batch size of 256. This means each time a parameter is loaded,
✖ Our use of cookies
it isWe
utilized for 256 different inferences.
use necessary cookies to make our site work. We also set performance and functionality cookies that
help us make improvements by measuring traffic on our site. For more detailed information about the
cookies we use, please see our privacy policy.
FLOPS utilization improves as batch size increases because the memory wall is being
mitigated. Latency can only be brought to a reasonable point by splitting the work
across more chips. With that said, even then, only 40% of FLOPS are even used.
Google demonstrated 76% FLOPS utilization with a latency of 85.2 seconds for PaLM
inference, so the memory wall is still clearly a massive factor.

So server side is far more efficient, but how far can local models scale?

Client-side Model Architecture Evolution &

Hardware Advancements
There are a variety of techniques to try to increase the number of parameters a model
on the client side can use. Furthermore, there are some interesting hardware
developments coming from Apple and AMD specifically, which could help with
scaling further.

Hi [email protected]
This post is for paid subscribers

Already a paid subscriber? Switch accounts

A guest post by

Sophia Wisdom
GPU kernel engineer at magic.dev

We use necessary cookies to make our site work.
Substack is We also
the home forset performance
great writing and functionality cookies that
help us make improvements by measuring traffic on our site. For more detailed information about the
cookies we use, please see our privacy policy.

Video Diffusion Tutorial Prof Mike Shou NUS 2023 Dec 15
No ratings yet
Video Diffusion Tutorial Prof Mike Shou NUS 2023 Dec 15
274 pages
The Data Science Framework: Juan J. Cuadrado-Gallego Yuri Demchenko
No ratings yet
The Data Science Framework: Juan J. Cuadrado-Gallego Yuri Demchenko
202 pages
Npu AI
No ratings yet
Npu AI
54 pages
Implement AI in Healthcare
No ratings yet
Implement AI in Healthcare
56 pages
Dhca Middle East Healthcare 2023 v4 Digital
No ratings yet
Dhca Middle East Healthcare 2023 v4 Digital
40 pages
Autonomic Nervous System
100% (1)
Autonomic Nervous System
96 pages
New Therapy Platform
No ratings yet
New Therapy Platform
20 pages
Unit 4-1-Nutritional Status and Nutritional Assessment - FTO201A - Nutrition in Health and Fitness
100% (1)
Unit 4-1-Nutritional Status and Nutritional Assessment - FTO201A - Nutrition in Health and Fitness
58 pages
Human Reproduction
100% (1)
Human Reproduction
56 pages
Synthesis of Silver Nanoparticles Using Combination of Beta Vulgaris A Project Report
100% (1)
Synthesis of Silver Nanoparticles Using Combination of Beta Vulgaris A Project Report
34 pages
ATV - CVPR'23 Tutorial
No ratings yet
ATV - CVPR'23 Tutorial
152 pages
CapMinds Deck
No ratings yet
CapMinds Deck
45 pages
AI Trends Presentation
No ratings yet
AI Trends Presentation
31 pages
Personalizing On Device AI Agents A Technical Roadmap
No ratings yet
Personalizing On Device AI Agents A Technical Roadmap
10 pages
The Next Wave of Self Care Digital Health Iqvia Consumer Health
No ratings yet
The Next Wave of Self Care Digital Health Iqvia Consumer Health
21 pages
Rm&i CH-5
No ratings yet
Rm&i CH-5
22 pages
Mental Health in The Digital Age
No ratings yet
Mental Health in The Digital Age
3 pages
Surface Enhanced Raman Spectros
No ratings yet
Surface Enhanced Raman Spectros
23 pages
Rm&i CH-4
No ratings yet
Rm&i CH-4
25 pages
PM Template - Product Roadmap
No ratings yet
PM Template - Product Roadmap
27 pages
© Ncert Not To Be Republished: After Reading This Chapter, You Would Be Able To
100% (1)
© Ncert Not To Be Republished: After Reading This Chapter, You Would Be Able To
22 pages
Implementation of FDA Regulations
No ratings yet
Implementation of FDA Regulations
44 pages
Building Resilience, Promoting Wellbeing
100% (1)
Building Resilience, Promoting Wellbeing
35 pages
PSYCH ASSESSMENT Notes
No ratings yet
PSYCH ASSESSMENT Notes
28 pages
Guidance Digital Therapeutics (DTX)
No ratings yet
Guidance Digital Therapeutics (DTX)
19 pages
On-Device Language Models - A Comprehensive Review
No ratings yet
On-Device Language Models - A Comprehensive Review
38 pages
Emotion Recognition With Image Processing
No ratings yet
Emotion Recognition With Image Processing
7 pages
Ftir Spectroscopy: PRESENTATION BY: BIPASHA SARKAR (2013712055003) M.Sc. I FTM
No ratings yet
Ftir Spectroscopy: PRESENTATION BY: BIPASHA SARKAR (2013712055003) M.Sc. I FTM
18 pages
Zimperium Mobile Threat Defense PDF
No ratings yet
Zimperium Mobile Threat Defense PDF
2 pages
Arxiv M4
No ratings yet
Arxiv M4
16 pages
Abpsych Anxiety Compre Trans
No ratings yet
Abpsych Anxiety Compre Trans
10 pages
Text Classification On Call Center Data Using BERT
No ratings yet
Text Classification On Call Center Data Using BERT
4 pages
Toward The Science of Engagement With Digital Interventions
No ratings yet
Toward The Science of Engagement With Digital Interventions
8 pages
Role of Al in Drug Discivery
No ratings yet
Role of Al in Drug Discivery
20 pages
Incident-Response-and-Recovery and Cloud Security
No ratings yet
Incident-Response-and-Recovery and Cloud Security
10 pages
Mechanism of Drug Action PDF
No ratings yet
Mechanism of Drug Action PDF
1 page
A Blood Pressure Prediction Method Based On Imaging Photoplethysmography in Combination With Machine Learning
No ratings yet
A Blood Pressure Prediction Method Based On Imaging Photoplethysmography in Combination With Machine Learning
11 pages
COCIR Analysis On AI in Medical Device Legislation - Sept. 2020 - Final 2
No ratings yet
COCIR Analysis On AI in Medical Device Legislation - Sept. 2020 - Final 2
44 pages
Neural Network: Submitted By-Aarushi Sharma 4-CSE-A 1729010003
No ratings yet
Neural Network: Submitted By-Aarushi Sharma 4-CSE-A 1729010003
18 pages
Psyel45a Reviwer
No ratings yet
Psyel45a Reviwer
22 pages
Facial Emotion Recognition Using Machine Learning
No ratings yet
Facial Emotion Recognition Using Machine Learning
10 pages
Technical Documentation
0% (1)
Technical Documentation
21 pages
Accelerating Microservices Design and Development Codex2533
No ratings yet
Accelerating Microservices Design and Development Codex2533
11 pages
AI in Drug Discovery - 032019 PDF
No ratings yet
AI in Drug Discovery - 032019 PDF
8 pages
Cofactor Statistics
100% (1)
Cofactor Statistics
27 pages
Application of Software Engineering in Healthcare: Enhancing Artificial Intelligence and Machine Learning For Medical Products and Drug Discovery
No ratings yet
Application of Software Engineering in Healthcare: Enhancing Artificial Intelligence and Machine Learning For Medical Products and Drug Discovery
8 pages
Virtual Reality
No ratings yet
Virtual Reality
12 pages
Gene AI
No ratings yet
Gene AI
13 pages
Baseline The Baseline On MDM
No ratings yet
Baseline The Baseline On MDM
24 pages
Psych Normal Sexuality
No ratings yet
Psych Normal Sexuality
5 pages
Convolutional Neural Networks For Visual Recognition
No ratings yet
Convolutional Neural Networks For Visual Recognition
45 pages
UAE Regulatory Process EMERGO
No ratings yet
UAE Regulatory Process EMERGO
1 page
Generative AI
No ratings yet
Generative AI
1 page
Drug Products Renewal of Marketing Authorization (MA) - GCC-1
No ratings yet
Drug Products Renewal of Marketing Authorization (MA) - GCC-1
2 pages
Types of Software Testing - Different Testing Types With Details
No ratings yet
Types of Software Testing - Different Testing Types With Details
14 pages
What Does Enterprise Architecture (EA) Mean?
No ratings yet
What Does Enterprise Architecture (EA) Mean?
1 page
Cortellis Regulatory Intelligence Factsheet
No ratings yet
Cortellis Regulatory Intelligence Factsheet
1 page
Temperament Scale: (Eating, Sleeping, Elimination)
No ratings yet
Temperament Scale: (Eating, Sleeping, Elimination)
1 page
White Paper - Pathway To Commercialization For An in Vitro Diagnostic (IVD) in The US
No ratings yet
White Paper - Pathway To Commercialization For An in Vitro Diagnostic (IVD) in The US
4 pages
Navigating the FDA 510(k) Process: Mastering the FDA Approval Process, #5
From Everand
Navigating the FDA 510(k) Process: Mastering the FDA Approval Process, #5
Dr. Nilesh Panchal
No ratings yet

2023-05 On Device AI - Double-Edged Sword

Uploaded by

2023-05 On Device AI - Double-Edged Sword

Uploaded by

On Device AI – Double-Edged Sword

✖ Our use of cookies

Why Local Models Are Needed

Giving smaller language models the performance of last-generation large models is

A simple, easily solved example is that of on-device speech-to-text. It is quite bad,

The Fundamental Limits of On-Device Inference

✖ Our use of cookies

✖ Our use of cookies

Limit On Model Size

✖ Our use of cookies

Why Server-Side AI Wins

Client-side Model Architecture Evolution &

Already a paid subscriber? Switch accounts

✖ Our use of cookies© 2023 SemiAnalysis LLC ∙ Privacy ∙ Terms ∙ Collection notice

You might also like