2023 LLMBC LLM Foundations
2023 LLMBC LLM Foundations
2023
LLM Foundations
Sergey Karayev
Agenda
00 01 02 03
FOUNDATIONS TRANSFORMER TRAINING
NOTABLE LLMS
OF ML ARCHITECTURE & INFERENCE
Speedrun Core ideas and
key ideas in ML notable examples T5, GPT, Chinchilla, et al Running a Transformer
2
00
Foundations of
Machine Learning
LLMBC 2023
📄 📄
📄📄 📄
👩💻 🤖 👩💻 🤖 🤖
Output Output
Learn structure of data to Learn how data maps to Learn how to act in an
generate more data labels to recognize or predict environment to obtain reward
"Hey
Siri"
5
LLMBC 2023
Converging on just...
cat
"This product does what
next
it is supposed __" "to." move
"Hey
Siri"
6
LLMBC 2023
What the machine "sees" [76, 105, 110, 99, 111, 108, 110]
7
LLMBC 2023
LLMBC 2023
How is it done?
- Logistic Regression
LLMBC 2023
Inspiration
Output: say "cat"
https://medicalxpress.com/news/2018-07-neuron-axons-spindly-theyre-optimizing.html
10
LLMBC 2023
Formalization
Output: say "cat"
https://www.the-scientist.com/the-nutshell/what-made-human-brains-so-big-36663
Inputs Inputs
Output
Output
https://medicalxpress.com/news/2018-07-neuron-axons-spindly-theyre-optimizing.html
https://www.jessicayung.com/explaining-tensor ow-code-for-a-multilayer-perceptron/
11
fl
LLMBC 2023
https://www.jessicayung.com/explaining-tensor ow-code-for-a-multilayer-perceptron/
12
fl
LLMBC 2023
https://www.jessicayung.com/explaining-tensor ow-code-for-a-multilayer-perceptron/
13
fl
LLMBC 2023
LLMBC 2023
Training
• Data X (e.g. images), labels y (e.g.
labels)
https://developers.google.com/machine-learning/testing-debugging/metrics/interpretic 15
LLMBC 2023
Dataset Splitting
• Split (X, y) into training (~80%),
validation (~10%), and test (~10%)
sets
16
fi
LLMBC 2023
Pre-training: Fine-tuning:
slow training on a lot of data fast training on a little data
LLMBC 2023
Model Hubs
- 180K models
- 30K datasets
18
LLMBC 2023
http://lucasb.eyer.be/transformer 19
LLMBC 2023
20
Transformer cartoon (DALL-E)
01
The Transformer
Architecture
10
LLMBC 2023
22
fi
LLMBC 2023
LLMBC 2023
Inputs
• Inputs need to be vectors of numbers
• Start with original text:
- "It's a blue sundress."
• Turn into a sequence of tokens:
- [<SOS>, It, 's, a, blue, sund, ress, ., <EOS>]
• Turn into vocabulary IDs:
- [0, 1026, 338, 257, 4171, 37437, 601, 13, 1]
• Each ID can be represented by a one-hot vector
- e.g. 3 -> [0, 0, 0, 1, 0, 0, 0, 0, ...]
24
LLMBC 2023
Input Embedding
• One-hot vectors are poor representations of words or tokens
LLMBC 2023
Attention
• (Ignore "Masked Multi-Head"
for now)
https://lilianweng.github.io/posts/2018-06-24-attention/ 26
LLMBC 2023
- (made to sum to 1)
http://www.peterbloem.nl/blog/transformers 27
LLMBC 2023
Basic self-attention
- Query
- Key
- Value
http://www.peterbloem.nl/blog/transformers 28
LLMBC 2023
Basic self-attention
• Problem: there's no learning
involved!
http://lucasb.eyer.be/transformer
http://www.peterbloem.nl/blog/transformers 29
LLMBC 2023
Multi-head attention
3-headed attention
http://www.peterbloem.nl/blog/transformers 30
ff
ff
LLMBC 2023
Masking attention
In training:
https://jalammar.github.io/illustrated-
- token comes in
32
ff
LLMBC 2023
Positional Encoding
• Attention is totally position-invariant!
33
LLMBC 2023
Add
• "Skip connections" aka "residual blocks"
34
fi
fl
LLMBC 2023
Layer Normalization
• Neural net modules perform best when input vectors have
uniform mean and std in each dimension.
35
https://arxiv.org/pdf/1803.08494.pdf
fl
LLMBC 2023
• Conceptual view:
36
fi
LLMBC 2023
Transformer Architecture
https://aizi.substack.com/p/how-does-gpt-3-spend-its-175b-parameters
• The main Transformer Layer is stacked many times
- Number of layers
- Embedding dimension
37
LLMBC 2023
38
LLMBC 2023
https://srush.github.io/raspy/ 39
LLMBC 2023
40
LLMBC 2023
https://www.youtube.com/watch?v=kCc8FmEb1nY
41
fi
ffi
LLMBC 2023
Resources
42
ff
LLMBC 2023
Questions?
43
03
Notable LLMs
30
LLMBC 2023
BERT (2019)
• Bidirectional Encoder Representations from Transformers
• 110M params
LLMBC 2023
• Encoder-Decoder architecture
• 11B parameters
LLMBC 2023
T5 Training Data
• Unsupervised pre-training on Colossal Clean
Crawled Corpus (C4)
https://paperswithcode.com/dataset/c4
https://stanford-cs324.github.io/winter2022/lectures/data/
48
LLMBC 2023
49
From Lukas Beyer’s lecture Intro to Transformers.
LLMBC 2023
https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf 50
fi
LLMBC 2023
- old-school NLP
tokenization, where
out-of-vocab words
would be replaced by
a special token
- UTF-8 bytes
51
LLMBC 2023
GPT-3 (2020)
• Just like GPT-2, but 100x larger (175B params)
52
LLMBC 2023
https://arxiv.org/pdf/2005.14165.pdf 53
LLMBC 2023
GPT-4 (2023)
54
55
LLMBC 2023
https://ourworldindata.org/grapher/ai-training-computation
LLMBC 2023
LLMs
56
LLMBC 2023
Chinchilla (2022)
• Empirically derived formulas for
optimal model and training set
size given a xed compute budget
https://arxiv.org/pdf/2203.15556.pdf 58
fi
LLMBC 2023
LLaMA (2023)
• "Chinchilla-optimal" open-source
LLMs from Meta
59
LLMBC 2023
• RedPajama:
open-source
recreation
https://arxiv.org/pdf/2302.13971.pdf
https://www.together.xyz/blog/redpajama 60
fi
LLMBC 2023
LLMBC 2023
Few-shot vs Zero-shot
- e.g. instruction-following
63
LLMBC 2023
Supervised Fine-tuning
https://openai.com/blog/how-should-ai-systems-behave 64
fi
LLMBC 2023
InstructGPT/GPT-3.5
• Released as text-
davinci-002 in OpenAI API
https://openai.com/blog/instruction-following/ 65
fi
ff
LLMBC 2023
ChatGPT
66
LLMBC 2023
"Alignment Tax"
ff
LLMBC 2023
69
fi
LLMBC 2023
OpenAssistant
https://huggingface.co/datasets/OpenAssistant/oasst1 70
LLMBC 2023
72
LLMBC 2023
Resources
73
LLMBC 2023
Questions?
74
04
50
LLMBC 2023
76
fi
LLMBC 2023
77
fi
LLMBC 2023
78
fi
LLMBC 2023
Parallelism
https://openai.com/blog/techniques-for-training-large-neural-networks/ 79
LLMBC 2023
o r
ns
Te
https://huggingface.co/blog/bloom-megatron-deepspeed
https://www.deepspeed.ai/training/ 80
LLMBC 2023
Sharded Data-Parallelism
Helpful video:
https://www.microsoft.com/en-us/research/blog/zero-deepspeed-new-system-optimizations-enable-training-models-with-over-100-billion-parameters/ 81
LLMBC 2023
82
fi
LLMBC 2023
https://arxiv.org/pdf/2205.01068.pdf 83
LLMBC 2023
https://openai.com/contributions/gpt-4 84
LLMBC 2023
85
LLMBC 2023
Auto-regressive Sampling
• Remember that we sample tokens one at a time
- [It's, a, blue, ...
https://arxiv.org/abs/1904.09751 86
LLMBC 2023
Runtime Complexity
• Self-attention runs in O(N^2) for
sequence length N
Based on "E cient Transformers: A Survey" by Yi Tay, Mostafa Dehghani, Dara Bahri, Donald Metzler
and "Long Range Arena: A Benchmark for E cient Transformers" by Y Tay, M Dehghani, S Abnar, Y Shen, D Bahri, P Pham, J Rao, L Yang, S Ruder, D Metzler
87
http://lucasb.eyer.be/transformer https://github.com/HazyResearch/ ash-attention
ffi
ffi
fl
LLaMA Quantization LLMBC 2023
Dealing with Large Model Sizes
• Large subject! Lilian Weng from OpenAI
has a thorough post
https://arxiv.org/pdf/2208.07339.pdf
https://arxiv.org/pdf/2212.09720.pdf 88
fi
fl
fi
LLMBC 2023
Resources
89
LLMBC 2023
Questions?
90
LLMBC 2023
Questions?
91
LLMBC 2023
Thanks!