E
E
net/publication/386077289
CITATIONS READS
0 48
9 authors, including:
Hyuk-Jae Lee
Seoul National University
251 PUBLICATIONS 2,423 CITATIONS
SEE PROFILE
All content following this page was uploaded by Kyudan Jung on 23 November 2024.
Abstract—Recently, OpenAI released the Chat GPT o1-preview speech into text, with Whisper currently holding the SOTA
model, which has reasoning abilities comparable to ranking position.
within the top 2,000 in the U.S. Math Olympiad, significantly sur- Deep learning models, particularly language models, pri-
passing the average human linguistic ability. This Transformer-
based model has become the standard not only in language marily use the transformer decoder architecture, and this trend
processing but also in various fields such as vision and speech. is likely to continue. This is largely due to the development
As the research in these fields progresses, a comprehensive un- of hardware specialized for model execution, particularly in
derstanding of GPT has become increasingly necessary. Through addressing the KV-cache issue during inference. Since the
a detailed analysis and mathematical understanding of the GPT- hardware industry does not rapidly change its products, it is
3 Transformer Decoder architecture, we explore why it was
designed this way and the effects of this design. Additionally, expected that the software architectures will also persist for
we trace the lineage of each component, examining the previous the time being.
research from which they emerged, and propose methods to As such, students interested in language models need to
deepen the understanding of GPT. We also examine the rationale develop a detailed understanding of the transformer decoder
behind the existence of the KV caching methodology and how it architecture. It is essential to understand the role of each layer
operates.
Index Terms—Transformer, Decoder, KV Cache, LLM and the reasoning behind their inclusion. Furthermore, it is
important to recognize that this architecture did not emerge
suddenly, but is part of a continuous evolution of research.
I. I NTRODUCTION Mastery of the transformer architecture will be crucial for
Recently, OpenAI released the o1-preview model, which effectively utilizing these models.
demonstrated powerful reasoning capabilities. This model per- II. A NALYSIS
formed particularly well in previously underperforming areas
such as mathematics and coding. For example, the previous First, we will explain the preprocessing of input data,
model, GPT-4o, correctly solved only 13% of the problems followed by a look at the lineage of each component of the
in the International Mathematical Olympiad qualifier, whereas model. Then, we will delve into an interpretation of detailed
the o1-preview model achieved a score of 83% [1]. computations.
This showcases how GPT models, which initially excelled A. Converting Text into Numbers
as simple language models, are now demonstrating strong
reasoning abilities, extending their applications beyond just In natural language processing models, inputting raw natural
language to visual and auditory artificial intelligence. In the language directly into the model is not feasible, as computers
domain of visual AI, for instance, the Vision Transformer cannot process it. Therefore, it is essential to convert the input
(ViT) [2] achieved state-of-the-art performance in image clas- text into numbers so that the computer can perform compu-
sification, surpassing traditional Convolutional Neural Net- tations. This process, or the transformation into computable
works (CNNs). Similarly, in the field of Optical Charac- vectors, is called ”embedding.” All existing natural language
ter Recognition (OCR), Nougat [3], which also utilizes a processing models include this embedding process to convert
transformer-based architecture, has reached SOTA, leading to a text into numbers.
reduced emphasis on architectural innovation within academia. In GPT-3, sentences are first tokenized using subword-based
In the auditory domain, OpenAI’s Whisper [4] and NVIDIA’s [6] byte pair encoding (BPE) [7]. The subword methodology
Canary [5] both employ transformer architectures to convert merges frequently co-occurring characters to form tokens,
which become the basic units for representing all languages.
This research was conducted with the support of the Hyundai Motor Chung In BPE, each token is mapped to bytes, including various lan-
Mong-Koo Foundation as part of a scholarship program. guages and special characters. These tokens are then mapped
Layer #1 (transformer.h.0) Layer #2 (transformer.h.1) Layer #12 (transformer.h.11)
class Block() class Block() class Block()
class CausalSelfAttention() class MLP()
Token embedding
[email protected](-2,-1)/k.size(-1)
att
I am a student (input)
Masking
Softmax
…
QKV split
GELU
logits
…
Position embedding
V
Can be accelerate with
FlashAttention
Fig. 1. An illustration depicting the model architecture of GPT-3, with the number of heads set to 12.
to predetermined integers. Next, a vector corresponding to The arrows following the input in Figure 1 show that some
each token is created based on these integers, called ”token inputs bypass the layers and are added back to the outputs
embedding.” This vector is pre-trained and captures semantic after passing through the layers. This method was inspired by
information between tokens during the learning process. In ResNet [14] in CNNs, where an identity mapping technique
GPT-3, this information is contained in the WTE (Word Token called the Residual connection was introduced. This technique
Embedding) layer. has been directly applied to the Transformer.
The generated token embedding vector is then added to the Thus, much of the seemingly complex GPT structure can
positional embedding vector, which encodes positional infor- be traced back to previous methodologies.
mation of the tokens [8]. In GPT-3, this positional information C. Layer Normalization
is stored in the WPE (Word Position Embedding) layer. Since
As shown in Figure 1, Transformers apply Layer Normal-
both vectors are of the same size, they can be summed, and
ization (LN) [15]. Generally, in models like CNNs, Batch
the resulting vector becomes the input to the model’s forward
Normalization (BN) is used, which standardizes values within
pass.
the same channel across multiple batches. In contrast, LN
performs normalization within a single batch. Specifically,
B. The Hierarchy and Lineage of the Transformer Class LN standardizes the vector values of tokens within a batch
The objective of the Transformer is to predict sequential sentence. If BN were applied, it would require normalizing the
data, which can take various forms such as language, speech, tokens at the same position across multiple sentences, which
or images. Models that handle sequential data started with Re- would cause issues. Unlike CNNs, Transformers process se-
current Neural Networks (RNNs) [9]. To address the gradient quential data, where batches are input sequentially. Using
vanishing problem of RNNs, LSTMs [10] were introduced, BN would require storing separate mean and variance values,
though the method had been proposed as early as 1997, which increases memory usage. Another issue with BN is that
before RNNs became popular. Subsequently, the seq2seq [11], it doesn’t work well when the batch size is small, as there
[12] model, which used LSTMs in a multilayer setup, was would be no variance in a batch of one.
proposed, introducing the Encoder-Decoder structure. This Mathematically, LN can be expressed as:
structure abstracted the input into a fixed-length vector using (x − µ)
LN(x) = γ √ +β
an encoder and then transformed it back into natural language σ2 + ϵ
through a decoder.
Here, x is the input, µ is the mean, and σ 2 is the variance.
Next, the revolutionary Attention mechanism was proposed, The constant ϵ prevents the denominator from being zero. The
followed by Self-Attention, which, when combined with previ- parameters γ and β are learned during training. After a residual
ous Seq2seq models, resulted in the initial Transformer model. connection is applied, LN is always performed to standardize
Figure 1 shows the Block layer of GPT-3. A Block is the output, as can be seen in Figure 1.
defined as a class. The number of Blocks stacked—12, 16,
or more—determines the model’s size (small, medium, large). D. Understanding the Linear Layer as Matrix Multiplication
For simplicity, let us examine just one Block. Each Block Transformers contain numerous Linear Layers, also known
consists of two main elements: the Causal Self-Attention as Fully Connected Layers. When several Linear Layers are
(CSA) class and the MLP class. The CSA class can be stacked, they are referred to as a Multi-Layer Perceptron
considered the feature extraction component, while the MLP (MLP). The operations in a Linear Layer are simply matrix
class acts as the classifier. This structure is quite traditional, multiplication and addition. If the input activation is x, the
dating back to early CNN models like LeNet [13]. In CNNs, weight matrix W and the bias vector b are applied to produce
feature extraction was done through convolution, whereas in the output as:
Transformers, it is achieved via self-attention. Wx + b
𝐶
𝑤 𝑤 … 𝑠ℎ𝑎𝑝𝑒: 𝐵 × 𝑇 × #ℎ ×
#ℎ
block, thanks to the Flash Attention algorithm [16], [17].
𝑤 𝑤 … 𝑇 Previously, multiple kernels were required, but with Flash
# ℎ𝑒𝑎𝑑
ℎ𝑒𝑎𝑑 𝑠𝑝𝑙𝑖𝑡 = 12 Attention, the attention process can be executed with a single
𝐶 𝑄 kernel, reducing training time by up to 40%. When computa-
3×2
#ℎ
ℎ𝑒𝑎𝑑 𝑠𝑝𝑙𝑖𝑡 tions are performed using multiple kernels, there is frequent
𝑠𝑒𝑞𝑢𝑒𝑛𝑐𝑒
=𝑇 ×𝐵 data movement between the GPU’s memory (HBM) and the
𝑤 𝑤 … 𝑤 𝑤 … ×𝐵 GPU itself, which consumes significant time for data loading.
However, when executed with a single kernel, the time spent
𝑇
ℎ𝑒𝑎𝑑 𝑠𝑝𝑙𝑖𝑡 # ℎ𝑒𝑎𝑑 on data loading and writing is drastically reduced. This is
𝐿𝑖𝑛𝑒𝑎𝑟 𝐿𝑎𝑦𝑒𝑟
= 12
self.c_attn
𝑑
𝑠𝑝𝑙𝑖𝑡
=𝐶 𝐶 𝐾
ℎ𝑒𝑎𝑑 𝑠𝑝𝑙𝑖𝑡
#ℎ
attention, resulting in a more efficient computation process.
Without KV cache
(I,I)
(am,
𝑄 × = (am,I) × 𝑉 = Attention 2
𝐾
𝐾
𝐾
am)
cached 𝐾
cached 𝐾
(am,
𝑄 × = (am,I) × cached 𝑉 = Attention 2
𝐾
am)
Fig. 3. An illustration of the advantages of the KV cache. Without the KV cache, inference in an autoregressive language model suffers from the issue of
increasing input length, as the model must reprocess all previous tokens. By using the KV cache, only the most recent token needs to be input, while the
Keys and Values of previous tokens are retrieved from the cache, enabling the model to compute the desired attention efficiently.
associated with sequential attention calculations. This im- [6] T. Kudo and J. Richardson, “SentencePiece: A simple and language inde-
provement plays a crucial role in making large language pendent subword tokenizer and detokenizer for neural text processing,”
in Proceedings of the 2018 Conference on Empirical Methods in Natural
models practically usable for real-time applications, where low Language Processing: System Demonstrations (E. Blanco and W. Lu,
latency is essential. eds.), (Brussels, Belgium), pp. 66–71, Association for Computational
Linguistics, Nov. 2018.
[7] R. Sennrich, B. Haddow, and A. Birch, “Neural machine translation
IV. C ONCLUSION of rare words with subword units,” in Proceedings of the 54th Annual
Meeting of the Association for Computational Linguistics (Volume 1:
In this paper, we conducted a detailed analysis of the Long Papers) (K. Erk and N. A. Smith, eds.), (Berlin, Germany),
Transformer Decoder architecture of GPT-3, discussing the pp. 1715–1725, Association for Computational Linguistics, Aug. 2016.
reasons for its design and its mathematical underpinnings. [8] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez,
L. u. Kaiser, and I. Polosukhin, “Attention is all you need,” in Advances
We demonstrated how GPT-3’s Block structure is situated in Neural Information Processing Systems (I. Guyon, U. V. Luxburg,
within the lineage of Transformers, which evolved from earlier S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett,
models such as RNN, LSTM, and Seq2Seq. Additionally, eds.), vol. 30, Curran Associates, Inc., 2017.
[9] A. Sherstinsky, “Fundamentals of recurrent neural network (rnn) and
we explained the roles of key components introduced to long short-term memory (lstm) network,” Physica D: Nonlinear Phe-
enhance the performance of GPT models, such as Causal nomena, vol. 404, p. 132306, 2020.
Self-Attention, MLP, and Layer Normalization. We also why [10] A. Graves and A. Graves, “Long short-term memory,” Supervised
sequence labelling with recurrent neural networks, pp. 37–45, 2012.
we should use KV-cache method. This research reaffirms the [11] I. Sutskever, O. Vinyals, and Q. V. Le, “Sequence to sequence learning
importance of the Transformer architecture not only in natural with neural networks,” in Proceedings of the 27th International Confer-
language processing (NLP) but also in various AI fields such ence on Neural Information Processing Systems - Volume 2, NIPS’14,
(Cambridge, MA, USA), p. 3104–3112, MIT Press, 2014.
as vision and audio. [12] K. Cho, B. van Merriënboer, C. Gulcehre, D. Bahdanau, F. Bougares,
H. Schwenk, and Y. Bengio, “Learning phrase representations using
ACKNOWLEDGMENT RNN encoder–decoder for statistical machine translation,” in Proceed-
ings of the 2014 Conference on Empirical Methods in Natural Language
This research project was sponsored by the Next Generation Processing (EMNLP) (A. Moschitti, B. Pang, and W. Daelemans,
eds.), (Doha, Qatar), pp. 1724–1734, Association for Computational
Semiconductor Convergence and Open Sharing System. Linguistics, Oct. 2014.
[13] Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learning
R EFERENCES applied to document recognition,” Proceedings of the IEEE, vol. 86,
no. 11, pp. 2278–2324, 1998.
[1] T. Zhong and Z. L. et al., “Evaluation of openai o1: Opportunities and [14] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image
challenges of agi,” 2024. recognition,” in 2016 IEEE Conference on Computer Vision and Pattern
[2] A. Dosovitskiy, “An image is worth 16x16 words: Transformers for Recognition (CVPR), pp. 770–778, 2016.
image recognition at scale,” International Conference on Learning [15] J. L. Ba, J. R. Kiros, and G. E. Hinton, “Layer normalization,” 2016.
Representations (ICLR), 2021. [16] T. Dao, D. Y. Fu, S. Ermon, A. Rudra, and C. Ré, “Flashattention: Fast
[3] L. Blecher, G. Cucurull, T. Scialom, and R. Stojnic, “Nougat: Neural and memory-efficient exact attention with io-awareness,” 2022.
optical understanding for academic documents,” 2023. [17] T. Dao, “Flashattention-2: Faster attention with better parallelism and
[4] A. Radford, J. W. Kim, T. Xu, G. Brockman, C. McLeavey, and work partitioning,” 2023.
I. Sutskever, “Robust speech recognition via large-scale weak super-
vision,” 2022.
[5] K. C. Puvvada, P. Żelasko, H. Huang, O. Hrinchuk, N. R. Koluguri,
K. Dhawan, S. Majumdar, E. Rastorgueva, Z. Chen, V. Lavrukhin, et al.,
“Less is more: Accurate speech recognition & translation without web-
scale data,” Interspeech 2024, 2024.