2023-05 On Device AI - Double-Edged Sword
2023-05 On Device AI - Double-Edged Sword
Scaling limits, model size constraints, why server based AI wins, and future
hardware improvements
DYLAN PATEL AND SOPHIA WISDOM
13 MAY 2023 ∙ PAID
61 5 Share
The most discussed part of the AI industry is chasing ever-larger language models
that can only be developed by mega-tech companies. While training these models is
costly, deploying them is even more difficult in some regards. In fact, OpenAI’s GPT-
4 is so large and compute-intensive that just running inference requires multiple
~$250,000 servers, each with 8 GPUs, loads of memory, and a ton of high-speed
networking. Google takes a similar approach with its full-size PaLM model, which
requires 64 TPUs and 16 CPUs to run. Meta’s largest recommendation models in
2021 required 128 GPUs to serve users. The world of ever more powerful models will
continue to proliferate, especially with AI-focused clouds and ML Ops firms such as
MosaicML assisting enterprises with developing and deploying LLMs…but bigger is
not always better.
There is a whole different universe of the AI industry that seeks to reject big iron
compute. The open-source movement around small models that can run on client
device is probably the 2nd most discussed part of the industry. While models on the
scale of GPT-4 or the full PaLM could never hope to run on laptops and smartphones,
even with 5 more years of hardware advancement due to the memory wall, there is a
thriving ecosystem of model development geared towards on-device inference.
Today we are going to discuss these smaller models on client-side devices such as
laptops and phones. This discussion will focus on the gating factors for inference
performance, the fundamental limits to model sizes, and how hardware development
going forward will establish the boundaries of development here.
A bigger example is that of Siri, Alexa, etc, being quite horrible as personal
assistants. Large language models with the assistance of natural voice synthesizing
AI could unlock far more human and intelligent AI assistants that can assist with
your life. From creating calendar events to summarizing conversations to search,
there will be a personal assistant on every device based on a multi-modal language
model.
✖ These
Our use models are already far more capable than Siri, Google Assistant, Alexa,
of cookies
We use necessary cookies to make our site work. We also set performance and functionality cookies that
Bixby, etc, and we are still in the very early innings.
help us make improvements by measuring traffic on our site. For more detailed information about the
cookies we use, please see our privacy policy.
In some ways, generative AI is rapidly becoming a bimodal distribution with massive
foundational models and much smaller models that can run on client devices,
garnering the majority of investment and a great chasm in between.
The highest-end client mobile devices will come with ~50 billion transistors and
more than enough TFLOP/s for on-device AI due to architectural innovations that
are in the pipeline at firms like Intel, AMD, Apple, Google, Samsung, Qualcomm,
and MediaTek. To be clear, none of their existing client AI accelerators are well-
suited for transformers, but that will change in a few years. These advancements in
the digital logic side of chips will solve the computing problem, but they cannot
tackle the true underlying problem of the memory wall and data reuse.
GPT-style models are trained to predict the next token (~= word) given the previous
tokens. To generate text with them, you feed it the prompt, then ask it to predict the
next token, then append that generated token into the prompt, then ask it to predict
the next token, and keep on going. In order to do this, you have to send all the
parameters from RAM to the processor every time you predict the next token. The
first problem is that you have to store all these parameters as close as possible to the
compute. The other problem is that you have to be able to load these parameters
from compute onto the chip exactly when you need them.
Furthermore, this 14GB ignores other applications, operating system, and other
overhead related to activations/kv cache. This puts an immediate limit on the size of
the model that a developer can use to deploy on-device AI, even if they can assume
the client endpoint has the computational oomph required. Storing 14GB of
parameters on a client-side processor is physically impossible. The most common
typeOur
✖ ofuse
on-chip memory is SRAM, which even on TSMC 3nm is only ~0.6GB per
of cookies
100mm^2.
We use necessary cookies to make our site work. We also set performance and functionality cookies that
help us make improvements by measuring traffic on our site. For more detailed information about the
cookies we use, please see our privacy policy.
For reference, that is about the same size chip as the upcoming iPhone 15 Pro’s A17
and ~25% smaller than the upcoming M3. Furthermore, this figure is without
overhead from assist circuity, array inefficiency, NOCs, etc. Large amounts of local
SRAM will not work for client inference. Emerging memories such as FeRAM and
MRAM do bring some hope for a light at the end of the tunnel, but they are quite far
away from being productized on the scale of Gigabytes.
The next layer down the hierarchy is DRAM. The highest-end iPhone, the 14 Pro
Max, has 6GB of RAM, but the modal iPhone has 3GB RAM. While high-end PC’s
will have 16GB+, the majority of new sales are at 8GB of RAM. The typical client
device cannot run a 7 billion parameter model quantized to FP16!
This then brings up the question. Why can’t we go down another layer in the
hierarchy? Can we run these models off of NAND-based SSDs instead of off RAM?
Unfortunately, this is far too slow. A 7 billion parameter model at FP16 requires
14GB/s of IO just to stream the weights in to produce 1 token (~4 characters)! The
fastest PC storage drives are at best 6GB/s, but most phones and PCs come in south
of 1GB/s. At 1GB/s, at 4-bit quantization, the biggest model that can be run would
still only be in the range of ~2 billion parameters, and that’s with pegging the SSD at
max for only 1 application with no regard for any other use-case.
This is a fundamental limit on how large and powerful local AI can get. Perhaps a
company like Apple can use that to upsell newer, more expensive phones with more
advanced AI, but that is still a while away. With the same assumptions as above, on
PC, Intel’s top-of-the-line 13th generation i9 CPUs and Apple’s M2 are capped out at
about ~3 to ~4 billion parameters.
In general, these are just the lower bounds for consumer devices. To repeat, we
ignored multiple factors, including using theoretical IO speed, which is never
achieved or activations/kv cache for simplicity sake. Those only push BW
requirements up more, and constrain model size down further. We will talk more
below about innovative hardware platforms that will come next year that can help
reshape the landscape, but the memory wall limits the majority of current and future
devices.
Our data shows that HBM memory is nearly half the manufacturing costs of a server-
class AI chip like the H100 or TPUv5. While client-side compute does get to use
significantly cheaper DDR and LPDDR memory (~4x per GB), that memory cost
cannot be amortized over multiple concurrent inferences. Batch size cannot be
pumped to infinity because that introduces another difficult problem, which is that
any single token has to wait for every other token to be processed before it can
append its results and work on the generation of the new token.
This is solved by splitting the model across many chips. The above chart is the
latency for generating 20 tokens. Conveniently, the PaLM model hits the 6.67 tokens
per second, or a ~200 words per minute minimum viable target, with 64 chips
running inference at a batch size of 256. This means each time a parameter is loaded,
✖ Our use of cookies
it isWe
utilized for 256 different inferences.
use necessary cookies to make our site work. We also set performance and functionality cookies that
help us make improvements by measuring traffic on our site. For more detailed information about the
cookies we use, please see our privacy policy.
FLOPS utilization improves as batch size increases because the memory wall is being
mitigated. Latency can only be brought to a reasonable point by splitting the work
across more chips. With that said, even then, only 40% of FLOPS are even used.
Google demonstrated 76% FLOPS utilization with a latency of 85.2 seconds for PaLM
inference, so the memory wall is still clearly a massive factor.
So server side is far more efficient, but how far can local models scale?
Hi [email protected]
This post is for paid subscribers
Subscribe
A guest post by
Sophia Wisdom
GPU kernel engineer at magic.dev