0% found this document useful (0 votes)

6 views

E

Uploaded by

mittul25

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

6 views

E

Uploaded by

mittul25

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 5

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/386077289

Analysis of Transformer Decoder Architecture and KV Cache Behavior During

LLM Inference

Conference Paper · January 2025

CITATIONS READS

0 48

9 authors, including:

Kyudan Jung Nam-Joon Kim

Chung-Ang University Seoul National University
11 PUBLICATIONS 1 CITATION 15 PUBLICATIONS 46 CITATIONS

SEE PROFILE SEE PROFILE

Hyuk-Jae Lee
Seoul National University
251 PUBLICATIONS 2,423 CITATIONS

SEE PROFILE

All content following this page was uploaded by Kyudan Jung on 23 November 2024.

The user has requested enhancement of the downloaded file.

Analysis of Transformer Decoder Architecture and
KV Cache Behavior During LLM Inference
Kyudan Jung Jeong Youn Kwon Young-Dae Mun Byeong-Geun Kang Joon-seok Song
Chung-Ang University Chung-Ang University Chung-Ang University Chung-Ang University Chung-Ang University
[email protected] [email protected] [email protected] [email protected] [email protected]

Min-Ji Kim Nam-joon Kim Hyungon Ryu Hyuk-jae Lee

Chung-Ang University Seoul National University NVIDIA Seoul National University
[email protected] [email protected] [email protected] [email protected]

Abstract—Recently, OpenAI released the Chat GPT o1-preview speech into text, with Whisper currently holding the SOTA
model, which has reasoning abilities comparable to ranking position.
within the top 2,000 in the U.S. Math Olympiad, significantly sur- Deep learning models, particularly language models, pri-
passing the average human linguistic ability. This Transformer-
based model has become the standard not only in language marily use the transformer decoder architecture, and this trend
processing but also in various fields such as vision and speech. is likely to continue. This is largely due to the development
As the research in these fields progresses, a comprehensive un- of hardware specialized for model execution, particularly in
derstanding of GPT has become increasingly necessary. Through addressing the KV-cache issue during inference. Since the
a detailed analysis and mathematical understanding of the GPT- hardware industry does not rapidly change its products, it is
3 Transformer Decoder architecture, we explore why it was
designed this way and the effects of this design. Additionally, expected that the software architectures will also persist for
we trace the lineage of each component, examining the previous the time being.
research from which they emerged, and propose methods to As such, students interested in language models need to
deepen the understanding of GPT. We also examine the rationale develop a detailed understanding of the transformer decoder
behind the existence of the KV caching methodology and how it architecture. It is essential to understand the role of each layer
operates.
Index Terms—Transformer, Decoder, KV Cache, LLM and the reasoning behind their inclusion. Furthermore, it is
important to recognize that this architecture did not emerge
suddenly, but is part of a continuous evolution of research.
I. I NTRODUCTION Mastery of the transformer architecture will be crucial for
Recently, OpenAI released the o1-preview model, which effectively utilizing these models.
demonstrated powerful reasoning capabilities. This model per- II. A NALYSIS
formed particularly well in previously underperforming areas
such as mathematics and coding. For example, the previous First, we will explain the preprocessing of input data,
model, GPT-4o, correctly solved only 13% of the problems followed by a look at the lineage of each component of the
in the International Mathematical Olympiad qualifier, whereas model. Then, we will delve into an interpretation of detailed
the o1-preview model achieved a score of 83% [1]. computations.
This showcases how GPT models, which initially excelled A. Converting Text into Numbers
as simple language models, are now demonstrating strong
reasoning abilities, extending their applications beyond just In natural language processing models, inputting raw natural
language to visual and auditory artificial intelligence. In the language directly into the model is not feasible, as computers
domain of visual AI, for instance, the Vision Transformer cannot process it. Therefore, it is essential to convert the input
(ViT) [2] achieved state-of-the-art performance in image clas- text into numbers so that the computer can perform compu-
sification, surpassing traditional Convolutional Neural Net- tations. This process, or the transformation into computable
works (CNNs). Similarly, in the field of Optical Charac- vectors, is called ”embedding.” All existing natural language
ter Recognition (OCR), Nougat [3], which also utilizes a processing models include this embedding process to convert
transformer-based architecture, has reached SOTA, leading to a text into numbers.
reduced emphasis on architectural innovation within academia. In GPT-3, sentences are first tokenized using subword-based
In the auditory domain, OpenAI’s Whisper [4] and NVIDIA’s [6] byte pair encoding (BPE) [7]. The subword methodology
Canary [5] both employ transformer architectures to convert merges frequently co-occurring characters to form tokens,
which become the basic units for representing all languages.
This research was conducted with the support of the Hyundai Motor Chung In BPE, each token is mapped to bytes, including various lan-
Mong-Koo Foundation as part of a scholarship program. guages and special characters. These tokens are then mapped
Layer #1 (transformer.h.0) Layer #2 (transformer.h.1) Layer #12 (transformer.h.11)
class Block() class Block() class Block()
class CausalSelfAttention() class MLP()
Token embedding

[email protected](-2,-1)/k.size(-1)

Weight sharing scheme with wte.weight

(B,T,n_head,C/n_head).transpose(1,2)

Classifier Fully connected layer (c_fc)

Classifier Linear projection (c_proj)
Q

Classifier Linear projection (c_proj)

Language Moeling Head (lm_head)

Layer normalization Final (ln_f)
att @ V.transpose(1,2) (B,T,C)
Layer normalization (ln_1)

Layer normalization (ln_2)

Classifier Linear (c_attn)

att
I am a student (input)

Masking

Softmax
…
QKV split

GELU

logits
…
Position embedding

V
Can be accelerate with
FlashAttention

Fig. 1. An illustration depicting the model architecture of GPT-3, with the number of heads set to 12.

to predetermined integers. Next, a vector corresponding to The arrows following the input in Figure 1 show that some
each token is created based on these integers, called ”token inputs bypass the layers and are added back to the outputs
embedding.” This vector is pre-trained and captures semantic after passing through the layers. This method was inspired by
information between tokens during the learning process. In ResNet [14] in CNNs, where an identity mapping technique
GPT-3, this information is contained in the WTE (Word Token called the Residual connection was introduced. This technique
Embedding) layer. has been directly applied to the Transformer.
The generated token embedding vector is then added to the Thus, much of the seemingly complex GPT structure can
positional embedding vector, which encodes positional infor- be traced back to previous methodologies.
mation of the tokens [8]. In GPT-3, this positional information C. Layer Normalization
is stored in the WPE (Word Position Embedding) layer. Since
As shown in Figure 1, Transformers apply Layer Normal-
both vectors are of the same size, they can be summed, and
ization (LN) [15]. Generally, in models like CNNs, Batch
the resulting vector becomes the input to the model’s forward
Normalization (BN) is used, which standardizes values within
pass.
the same channel across multiple batches. In contrast, LN
performs normalization within a single batch. Specifically,
B. The Hierarchy and Lineage of the Transformer Class LN standardizes the vector values of tokens within a batch
The objective of the Transformer is to predict sequential sentence. If BN were applied, it would require normalizing the
data, which can take various forms such as language, speech, tokens at the same position across multiple sentences, which
or images. Models that handle sequential data started with Re- would cause issues. Unlike CNNs, Transformers process se-
current Neural Networks (RNNs) [9]. To address the gradient quential data, where batches are input sequentially. Using
vanishing problem of RNNs, LSTMs [10] were introduced, BN would require storing separate mean and variance values,
though the method had been proposed as early as 1997, which increases memory usage. Another issue with BN is that
before RNNs became popular. Subsequently, the seq2seq [11], it doesn’t work well when the batch size is small, as there
[12] model, which used LSTMs in a multilayer setup, was would be no variance in a batch of one.
proposed, introducing the Encoder-Decoder structure. This Mathematically, LN can be expressed as:
structure abstracted the input into a fixed-length vector using (x − µ)
LN(x) = γ √ +β
an encoder and then transformed it back into natural language σ2 + ϵ
through a decoder.
Here, x is the input, µ is the mean, and σ 2 is the variance.
Next, the revolutionary Attention mechanism was proposed, The constant ϵ prevents the denominator from being zero. The
followed by Self-Attention, which, when combined with previ- parameters γ and β are learned during training. After a residual
ous Seq2seq models, resulted in the initial Transformer model. connection is applied, LN is always performed to standardize
Figure 1 shows the Block layer of GPT-3. A Block is the output, as can be seen in Figure 1.
defined as a class. The number of Blocks stacked—12, 16,
or more—determines the model’s size (small, medium, large). D. Understanding the Linear Layer as Matrix Multiplication
For simplicity, let us examine just one Block. Each Block Transformers contain numerous Linear Layers, also known
consists of two main elements: the Causal Self-Attention as Fully Connected Layers. When several Linear Layers are
(CSA) class and the MLP class. The CSA class can be stacked, they are referred to as a Multi-Layer Perceptron
considered the feature extraction component, while the MLP (MLP). The operations in a Linear Layer are simply matrix
class acts as the classifier. This structure is quite traditional, multiplication and addition. If the input activation is x, the
dating back to early CNN models like LeNet [13]. In CNNs, weight matrix W and the bias vector b are applied to produce
feature extraction was done through convolution, whereas in the output as:
Transformers, it is achieved via self-attention. Wx + b
𝐶
𝑤 𝑤 … 𝑠ℎ𝑎𝑝𝑒: 𝐵 × 𝑇 × #ℎ ×
#ℎ
block, thanks to the Flash Attention algorithm [16], [17].
𝑤 𝑤 … 𝑇 Previously, multiple kernels were required, but with Flash
# ℎ𝑒𝑎𝑑
ℎ𝑒𝑎𝑑 𝑠𝑝𝑙𝑖𝑡 = 12 Attention, the attention process can be executed with a single
𝐶 𝑄 kernel, reducing training time by up to 40%. When computa-

3×2
#ℎ
ℎ𝑒𝑎𝑑 𝑠𝑝𝑙𝑖𝑡 tions are performed using multiple kernels, there is frequent
𝑠𝑒𝑞𝑢𝑒𝑛𝑐𝑒
=𝑇 ×𝐵 data movement between the GPU’s memory (HBM) and the
𝑤 𝑤 … 𝑤 𝑤 … ×𝐵 GPU itself, which consumes significant time for data loading.
However, when executed with a single kernel, the time spent
𝑇
ℎ𝑒𝑎𝑑 𝑠𝑝𝑙𝑖𝑡 # ℎ𝑒𝑎𝑑 on data loading and writing is drastically reduced. This is
𝐿𝑖𝑛𝑒𝑎𝑟 𝐿𝑎𝑦𝑒𝑟

= 12
self.c_attn

achieved by optimizing the softmax operation in standard

3×2
3 times

𝑑
𝑠𝑝𝑙𝑖𝑡

=𝐶 𝐶 𝐾
ℎ𝑒𝑎𝑑 𝑠𝑝𝑙𝑖𝑡
#ℎ
attention, resulting in a more efficient computation process.

×𝐵 III. KV C ACHE IN M ODEL I NFERENCE

𝑤 𝑤 … ×𝐵
×𝐵 During model inference, the use of a Key-Value (KV) cache
= 𝑏𝑎𝑡𝑐ℎ 𝑇 significantly enhances efficiency, particularly for autoregres-
ℎ𝑒𝑎𝑑 𝑠𝑝𝑙𝑖𝑡 # ℎ𝑒𝑎𝑑
= 12 sive models like GPT-3. The KV cache stores intermediate
3×2

𝐶 𝑉 states of the attention mechanism, allowing for faster computa-

ℎ𝑒𝑎𝑑 𝑠𝑝𝑙𝑖𝑡 #ℎ
tion when generating subsequent tokens. This section explains
×𝐵 ×𝐵 the flow of data in the KV cache during inference, as well as
= 𝑏𝑎𝑡𝑐ℎ ×𝐵 its benefits in reducing computational redundancy.
In transformer-based language models, every generated to-
Fig. 2. A diagram illustrating the process of creating Q, K, and V matrices ken depends on all previously generated tokens, making the
in Multihead Self-Attention.
process inherently sequential. When predicting a new token,
the model recalculates the attention scores for all the preceding
If x is an n×1 vector, W is an m×n matrix, and m represents tokens, which can be computationally expensive. The KV
the output size. The resulting output is an m × 1 vector. Each cache, however, mitigates this issue by storing the Key (K) and
element of the output can be viewed as a linear combination Value (V) matrices produced in each layer during the previous
of the input activations weighted by W . Training involves forward pass.
updating these weights W to minimize the error in prediction. For instance, when a new token is to be predicted, instead
of recalculating the K and V matrices for all previous tokens,
E. Masked Multihead Self-Attention the model only computes them for the newly generated token
As shown in Figure 2, the multihead mechanism is straight- and appends them to the existing cache. The Query (Q)
forward [8]. The original data, represented by an orange for the new token is then used to compute attention scores
activation matrix, is input into a Linear Layer with an output against all stored K matrices, effectively leveraging the saved
size three times the input size. The data is then divided into computations from earlier steps. By caching and reusing these
three parts: Q, K, and V. The data is split along the channel matrices, the model avoids redundant computations, resulting
dimension into multiple heads, allowing separate training for in significant speedups.
each head. In Figure 3, we illustrate how the KV cache works during
The main objective of attention is to extract the relationships inference. For each generated token, the K and V matrices
between tokens in a sentence using Q, K, and V matrices. After are appended to the existing cache. This allows the model to
head splitting, the shapes of Q, K, and V matrices become perform efficient attention calculations with minimal recompu-
(B, h, T, C/h), where B is the batch size, h is the number of tation. The resulting efficiency boost is particularly noticeable
heads, T is the number of tokens, and C is the embedding in longer sequences, where recomputation without a KV cache
dimension. The dot product of Q and K T represents the would grow increasingly prohibitive.
relationship between tokens. Another notable optimization related to KV caching is its
During training, however, the entire sentence is given at impact on memory usage. By using fixed-size buffers for
once. Since the goal is to predict the next word, we need storing K and V matrices, the memory footprint is reduced
to mask the attention to future tokens. By setting the future compared to recalculating and storing redundant information
information to −∞, masking can be applied. After masking, for each step. Moreover, specialized implementations, such
softmax and the dot product with the V matrix are performed as those described in Flash Attention, make use of kernel
to complete the attention process. This is followed by another fusion techniques to streamline both the caching and attention
Linear Layer, as shown in Figure 1. computation processes, minimizing memory overhead and
latency.
F. Flash Attention for Faster Computation Thus, KV caching is a fundamental optimization in trans-
As shown in the dark green box in Figure 1, the four former inference, effectively enabling the model to generate
processes involved in attention can be executed as a single long sequences without the usual computational bottlenecks
𝑄 𝑉 Attention 1

Without KV cache
(I,I)

(am,
𝑄 × = (am,I) × 𝑉 = Attention 2

𝐾
𝐾

𝐾
am)

𝑄 (a,I) (a,am) (a,a) 𝑉 Attention 3

𝑄 (I,I) cached 𝑉 Attention 1

With KV cache

cached 𝐾
cached 𝐾
(am,
𝑄 × = (am,I) × cached 𝑉 = Attention 2

𝐾
am)

𝑄 (a,I) (a,am) (a,a) 𝑉 Attention 3

Fig. 3. An illustration of the advantages of the KV cache. Without the KV cache, inference in an autoregressive language model suffers from the issue of
increasing input length, as the model must reprocess all previous tokens. By using the KV cache, only the most recent token needs to be input, while the
Keys and Values of previous tokens are retrieved from the cache, enabling the model to compute the desired attention efficiently.

associated with sequential attention calculations. This im- [6] T. Kudo and J. Richardson, “SentencePiece: A simple and language inde-
provement plays a crucial role in making large language pendent subword tokenizer and detokenizer for neural text processing,”
in Proceedings of the 2018 Conference on Empirical Methods in Natural
models practically usable for real-time applications, where low Language Processing: System Demonstrations (E. Blanco and W. Lu,
latency is essential. eds.), (Brussels, Belgium), pp. 66–71, Association for Computational
Linguistics, Nov. 2018.
[7] R. Sennrich, B. Haddow, and A. Birch, “Neural machine translation
IV. C ONCLUSION of rare words with subword units,” in Proceedings of the 54th Annual
Meeting of the Association for Computational Linguistics (Volume 1:
In this paper, we conducted a detailed analysis of the Long Papers) (K. Erk and N. A. Smith, eds.), (Berlin, Germany),
Transformer Decoder architecture of GPT-3, discussing the pp. 1715–1725, Association for Computational Linguistics, Aug. 2016.
reasons for its design and its mathematical underpinnings. [8] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez,
L. u. Kaiser, and I. Polosukhin, “Attention is all you need,” in Advances
We demonstrated how GPT-3’s Block structure is situated in Neural Information Processing Systems (I. Guyon, U. V. Luxburg,
within the lineage of Transformers, which evolved from earlier S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett,
models such as RNN, LSTM, and Seq2Seq. Additionally, eds.), vol. 30, Curran Associates, Inc., 2017.
[9] A. Sherstinsky, “Fundamentals of recurrent neural network (rnn) and
we explained the roles of key components introduced to long short-term memory (lstm) network,” Physica D: Nonlinear Phe-
enhance the performance of GPT models, such as Causal nomena, vol. 404, p. 132306, 2020.
Self-Attention, MLP, and Layer Normalization. We also why [10] A. Graves and A. Graves, “Long short-term memory,” Supervised
sequence labelling with recurrent neural networks, pp. 37–45, 2012.
we should use KV-cache method. This research reaffirms the [11] I. Sutskever, O. Vinyals, and Q. V. Le, “Sequence to sequence learning
importance of the Transformer architecture not only in natural with neural networks,” in Proceedings of the 27th International Confer-
language processing (NLP) but also in various AI fields such ence on Neural Information Processing Systems - Volume 2, NIPS’14,
(Cambridge, MA, USA), p. 3104–3112, MIT Press, 2014.
as vision and audio. [12] K. Cho, B. van Merriënboer, C. Gulcehre, D. Bahdanau, F. Bougares,
H. Schwenk, and Y. Bengio, “Learning phrase representations using
ACKNOWLEDGMENT RNN encoder–decoder for statistical machine translation,” in Proceed-
ings of the 2014 Conference on Empirical Methods in Natural Language
This research project was sponsored by the Next Generation Processing (EMNLP) (A. Moschitti, B. Pang, and W. Daelemans,
eds.), (Doha, Qatar), pp. 1724–1734, Association for Computational
Semiconductor Convergence and Open Sharing System. Linguistics, Oct. 2014.
[13] Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learning
R EFERENCES applied to document recognition,” Proceedings of the IEEE, vol. 86,
no. 11, pp. 2278–2324, 1998.
[1] T. Zhong and Z. L. et al., “Evaluation of openai o1: Opportunities and [14] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image
challenges of agi,” 2024. recognition,” in 2016 IEEE Conference on Computer Vision and Pattern
[2] A. Dosovitskiy, “An image is worth 16x16 words: Transformers for Recognition (CVPR), pp. 770–778, 2016.
image recognition at scale,” International Conference on Learning [15] J. L. Ba, J. R. Kiros, and G. E. Hinton, “Layer normalization,” 2016.
Representations (ICLR), 2021. [16] T. Dao, D. Y. Fu, S. Ermon, A. Rudra, and C. Ré, “Flashattention: Fast
[3] L. Blecher, G. Cucurull, T. Scialom, and R. Stojnic, “Nougat: Neural and memory-efficient exact attention with io-awareness,” 2022.
optical understanding for academic documents,” 2023. [17] T. Dao, “Flashattention-2: Faster attention with better parallelism and
[4] A. Radford, J. W. Kim, T. Xu, G. Brockman, C. McLeavey, and work partitioning,” 2023.
I. Sutskever, “Robust speech recognition via large-scale weak super-
vision,” 2022.
[5] K. C. Puvvada, P. Żelasko, H. Huang, O. Hrinchuk, N. R. Koluguri,
K. Dhawan, S. Majumdar, E. Rastorgueva, Z. Chen, V. Lavrukhin, et al.,
“Less is more: Accurate speech recognition & translation without web-
scale data,” Interspeech 2024, 2024.