0% found this document useful (0 votes)
200 views92 pages

2023 LLMBC LLM Foundations

The document provides an agenda for an LLM Bootcamp covering foundations of machine learning including key ideas in ML, types of ML, and how neural networks are inspired by and formalize concepts from biology. It discusses how neural networks work via matrix multiplications in layers and how they are trained through backpropagation to minimize loss on batches of data. It also covers splitting datasets into training, validation, and test sets and how large models can be fine-tuned on smaller datasets.

Uploaded by

abhijitkalita
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
200 views92 pages

2023 LLMBC LLM Foundations

The document provides an agenda for an LLM Bootcamp covering foundations of machine learning including key ideas in ML, types of ML, and how neural networks are inspired by and formalize concepts from biology. It discusses how neural networks work via matrix multiplications in layers and how they are trained through backpropagation to minimize loss on batches of data. It also covers splitting datasets into training, validation, and test sets and how large models can be fine-tuned on smaller datasets.

Uploaded by

abhijitkalita
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 92

LLM Bootcamp


2023

LLM Foundations
Sergey Karayev

APRIL 21, 2023


LLMBC 2023

Agenda

00 01 02 03
FOUNDATIONS TRANSFORMER TRAINING
NOTABLE LLMS
OF ML ARCHITECTURE & INFERENCE
Speedrun Core ideas and
key ideas in ML notable examples T5, GPT, Chinchilla, et al Running a Transformer

2





00

Foundations of
Machine Learning

LLMBC 2023

Traditional Programming vs Machine Learning


Input Training Input

📄 📄
📄📄 📄

👩💻 🤖 👩💻 🤖 🤖
Output Output

Software 1.0 Software 2.0


4
LLMBC 2023

Types of Machine Learning


Unsupervised Learning Supervised Learning Reinforcement Learning

Learn structure of data to Learn how data maps to Learn how to act in an
generate more data labels to recognize or predict environment to obtain reward

"This product does what


it is supposed __"
cat

"Hey
Siri"

5
LLMBC 2023

Converging on just...

Supervised or Self-supervised Learning

cat
"This product does what
next
it is supposed __" "to." move
"Hey
Siri"

6
LLMBC 2023

Inputs and outputs are always just numbers


Input Output

What we see "Lincoln"

What the machine "sees" [76, 105, 110, 99, 111, 108, 110]

7
LLMBC 2023

Why is this hard?


"I loved this movie"
• Infinite variety of inputs can all mean the "As good as The Godfather"
same thing "🔥 no cap"

• Meaningful differences can be tiny


• Structure of the world is complex



LLMBC 2023

How is it done?

• Many methods for Machine Learning

- Logistic Regression

- Support Vector Machines

- Decision Trees (xgboost)

• But one is dominant

- Neural Networks (also called Deep Learning)

LLMBC 2023

Inspiration
Output: say "cat"

• Inspired by what we know to be


intelligent: the brain

Input: see a cat


• The brain is composed of billions
of neurons
https://www.the-scientist.com/the-nutshell/what-made-human-brains-so-big-36663
• Each neuron receives electrical
Inputs inputs and sends an electrical
output
Output
• The brain itself has high-level
inputs and outputs

https://medicalxpress.com/news/2018-07-neuron-axons-spindly-theyre-optimizing.html

10

LLMBC 2023

Formalization
Output: say "cat"

Input: see a cat

https://www.the-scientist.com/the-nutshell/what-made-human-brains-so-big-36663

Inputs Inputs

Output
Output

https://medicalxpress.com/news/2018-07-neuron-axons-spindly-theyre-optimizing.html
https://www.jessicayung.com/explaining-tensor ow-code-for-a-multilayer-perceptron/

11
fl
LLMBC 2023

A "perceptron" is a vector of numbers

https://www.jessicayung.com/explaining-tensor ow-code-for-a-multilayer-perceptron/

12
fl
LLMBC 2023

A "layer" is a matrix of numbers

https://www.jessicayung.com/explaining-tensor ow-code-for-a-multilayer-perceptron/

13
fl
LLMBC 2023

The neural network is a set of matrices


Called "parameters" or "weights"

NN operations are just matrix multiplications.


GPUs are really fast at matrix multiplications.
14

LLMBC 2023

Training
• Data X (e.g. images), labels y (e.g.
labels)

• Take a little batch of data x:

- Use the current model to make a


prediction x -> y'

- Compute loss(y, y')

- Back-propagate the loss through all


the layers of the model

• Repeat until loss stops decreasing https://www.guru99.com/backpropogation-neural-network.html

https://developers.google.com/machine-learning/testing-debugging/metrics/interpretic 15

LLMBC 2023

Dataset Splitting
• Split (X, y) into training (~80%),
validation (~10%), and test (~10%)
sets

• Validation set is for


- ensuring that training is not
"over tting"

- setting hyper-parameters of the


model (e.g. number of parameters)
THIS APPLIES TO YOUR
• Test set is for measuring validity of EXPERIMENTATION
predictions on new data
WITH PROMPTS!

16

fi

LLMBC 2023

A LOT of data Much less data

Large model Fine-tuned large model

Pre-training: Fine-tuning:
slow training on a lot of data fast training on a little data

LLMBC 2023

Model Hubs

• People share pre-


trained models!

• 🤗: the most popular


Model Hub

- 180K models

- 30K datasets

18

LLMBC 2023

Before ~2020: each task had its own NN architecture

http://lucasb.eyer.be/transformer 19
LLMBC 2023

Now: all is Transformers

20
Transformer cartoon (DALL-E)
01

The Transformer
Architecture

10
LLMBC 2023

Attention is all you need (2017)

• Ground-breaking architecture that set


SOTA on rst translation and later all
other NLP tasks

• For simplicity, can just look at one half


of it

22
fi

LLMBC 2023

Transformer Decoder Overview


• Task is to complete text
- "It's a blue" -> "sundress"
• Inputs: a sequence of N tokens
- [It's, a, blue]
• Output:
- Probability distribution over the next token
• Inference:
- Sample the next token from the distribution,
append it to inputs, run through the model
again, sample, append, etc.
23

LLMBC 2023

Inputs
• Inputs need to be vectors of numbers
• Start with original text:
- "It's a blue sundress."
• Turn into a sequence of tokens:
- [<SOS>, It, 's, a, blue, sund, ress, ., <EOS>]
• Turn into vocabulary IDs:
- [0, 1026, 338, 257, 4171, 37437, 601, 13, 1]
• Each ID can be represented by a one-hot vector
- e.g. 3 -> [0, 0, 0, 1, 0, 0, 0, 0, ...]
24

LLMBC 2023

Input Embedding
• One-hot vectors are poor representations of words or tokens

- e.g. distance between "cat" and "kitten" is the same as


between "cat" and "tractor"

• Solution: learn an embedding matrix!

- (The simplest NN layer type)

VxV VxE VxE


embedding
matrix 25

LLMBC 2023

Attention
• (Ignore "Masked Multi-Head"
for now)

• Key insight: for a given token


in the output sequence, only
one or a few tokens in the
input sequence are most
important

• Introduced in 2015 for


translation tasks

https://lilianweng.github.io/posts/2018-06-24-attention/ 26


LLMBC 2023

Basic self-attention 1. Compute


attention
2. Combine
attention-
weights weighted
• Input: sequence of vectors inputs

• Output: sequence of vectors, each one a


weighted sum of the input sequence

- weight is just dot product



between input vectors

- (made to sum to 1)

http://www.peterbloem.nl/blog/transformers 27



LLMBC 2023

Basic self-attention

• Note that every input


vector is used in 3
ways:

- Query

- Key

- Value

http://www.peterbloem.nl/blog/transformers 28

LLMBC 2023

Basic self-attention
• Problem: there's no learning
involved!

• Solution: project inputs


into query, key, value roles

• Learning these matrices = learning


attention

http://lucasb.eyer.be/transformer
http://www.peterbloem.nl/blog/transformers 29

LLMBC 2023

Multi-head attention

• We can allow di erent ways of transforming


into queries, keys, and values to be learned

• Simply means learning di erent sets of W_q,


W_k, and W_v matrices simultaneously.

- (Actually implemented as a single matrix,


anyway.)

3-headed attention

http://www.peterbloem.nl/blog/transformers 30
ff
ff

LLMBC 2023

Masking attention
In training:

https://jalammar.github.io/illustrated-

Note how you shouldn't see future


tokens when predicting http://www.peterbloem.nl/blog/transformers 31
LLMBC 2023

Masked Multi-Head Attention


• Conceptual view:

- token comes in

- gets "augmented" with previously-seen


tokens that seem relevant (masked self-
attention)

- this happens in several di erent ways


simultaneously (multiple heads)

• NOTE: there's no notion of "position" so far!

32

ff

LLMBC 2023

Positional Encoding
• Attention is totally position-invariant!

- e.g. [this, movie, is, great] is the same


as [movie, this, great, is]

• So, let's add position-encoding vectors


to embedding vectors

- It really is that simple

33

LLMBC 2023

Add
• "Skip connections" aka "residual blocks"

- output = module(input) + input

- Allows gradient to ow from the loss function all


the way to the rst layer

- (Possible because each module's output is the


same shape as its input)

34
fi
fl

LLMBC 2023

Layer Normalization
• Neural net modules perform best when input vectors have
uniform mean and std in each dimension.

• As inputs ow through the network, means and std's get


blown out.

• Layer Normalization is a hack to reset things to where we


want them in between layers.

output = module(layer_norm(input)) + input

35
https://arxiv.org/pdf/1803.08494.pdf
fl

LLMBC 2023

Feed Forward Layer


• Standard Multi-Layer Perceptron with one hidden
layer

• De ned y = W2 (GeLU (W1 x + b1 )) + b2

• Conceptual view:

- token (augmented with other relevant tokens


that it has seen) comes in...

- ...and "upgrades" its representation

36
fi

LLMBC 2023

Transformer Architecture
https://aizi.substack.com/p/how-does-gpt-3-spend-its-175b-parameters
• The main Transformer Layer is stacked many times

• The overall hyperparameters are:

- Number of layers

- Embedding dimension

- Number of attention heads

• The largest models are ~70%


feed-forward weights

37

LLMBC 2023

Why does this work so well?

38
LLMBC 2023

Thinking like Transformers

• Restricted Access Sequence


Processing (RASP, 2021):
programming language of
Transformer-implementable
operations

https://srush.github.io/raspy/ 39
LLMBC 2023

We mostly don't understand it, though


• Much great work from
Anthropic if this has
captured your curiosity!

40
LLMBC 2023

Should you be able to code a Transformer?


• De nitely not necessary!

• BUT: it's not di cult, it is fun, and is probably worth doing

• Andrej Karpathy's GPT-2 implementation is <400 lines of code,


including Attention and MLP blocks

https://www.youtube.com/watch?v=kCc8FmEb1nY
41
fi
ffi

LLMBC 2023

Resources

• Lucas Beyer's Lecture on Transformers

• Peter Bloem's "Transformers from Scratch"

• Nelson Elhage's "Transformers for Software Engineers" for a


di erent view

• Andrej Karpathy's entire Neural Networks: Zero to Hero video series

• Lillian Weng's "The Transformer Family v2" megapost

42
ff

LLMBC 2023

Questions?

43
03

Notable LLMs

30
LLMBC 2023

Three Easy Pieces

From Lukas Beyer’s lecture Intro to Transformers.


LLMBC 2023

BERT (2019)
• Bidirectional Encoder Representations from Transformers

• Encoder-only (no attention masking)

• 110M params

• 15% of all words masked out

• Was great, now dated

From Lukas Beyer’s lecture Intro to Transformers. 46


LLMBC 2023

T5: Text-to-Text Transfer Transformer (2020)


• Input and output are both text
strings

• Encoder-Decoder architecture

• 11B parameters

• Still could be a good choice for


ne-tuning!

From Lukas Beyer’s lecture Intro to Transformers. 47


fi

LLMBC 2023

T5 Training Data
• Unsupervised pre-training on Colossal Clean
Crawled Corpus (C4)

- Start with Common Crawl (over 50TB of


compressed data, 10B+ web pages)

- Filtered down to ~800GB, or ~160B tokens

• Also trained on academic supervised tasks

https://paperswithcode.com/dataset/c4
https://stanford-cs324.github.io/winter2022/lectures/data/
48

LLMBC 2023

GPT / GPT-2 (2019)


• Generative Pre-trained Transformer

• Decoder-only (uses masked self-


attention)

• Largest model is 1.5B

49
From Lukas Beyer’s lecture Intro to Transformers.

LLMBC 2023

GPT-2 Training Data

• Found that Common Crawl has major data quality


issues

• Formed the WebText dataset

- scraped all outbound links (45M) from Reddit which


received at least 3 karma

• After de-duplication and some heuristic ltering, left


with 8M documents for a total of 40GB of text

https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf 50

fi
LLMBC 2023

Byte Pair Encoding

• How does GPT


tokenize?

• Middle ground between

- old-school NLP
tokenization, where
out-of-vocab words
would be replaced by
a special token

- UTF-8 bytes

51

LLMBC 2023

GPT-3 (2020)
• Just like GPT-2, but 100x larger (175B params)

• Exhibited unprecedented few-shot and zero-


shot learning

52

LLMBC 2023

GPT-3 Training Data

• For a total of 500B tokens

• But trained on only 300B!

https://arxiv.org/pdf/2005.14165.pdf 53

LLMBC 2023

GPT-4 (2023)

54
55
LLMBC 2023

https://ourworldindata.org/grapher/ai-training-computation
LLMBC 2023

The Bitter Lesson

LLMs

56
LLMBC 2023

But what exactly is the relationship between


model size and dataset size?
LLMBC 2023

Chinchilla (2022)
• Empirically derived formulas for
optimal model and training set
size given a xed compute budget

• Found that most LLMs are


"undertrained"

• Trained Chinchilla (70B) vs Gopher


(280B) at the same compute
budget, by using 4x fewer params
and 4x more data

• (Note that this is for one epoch)

https://arxiv.org/pdf/2203.15556.pdf 58
fi

LLMBC 2023

LLaMA (2023)

• "Chinchilla-optimal" open-source
LLMs from Meta

• Several sizes from 7B to 65B,


trained on at least 1T tokens

• Benchmarks competitively against


GPT-3 and other LLMs

• Open-source, but non-commercial

59

LLMBC 2023

LLaMA Training Data


• Custom quality- ltering of
CommonCrawl + some C4 + Github
+ Wikipedia + Books + ArXiV +
Stack Exchange

• RedPajama:
open-source
recreation

https://arxiv.org/pdf/2302.13971.pdf

https://www.together.xyz/blog/redpajama 60


fi
LLMBC 2023

Including code in training data


• T5 and GPT-3 (2020) speci cally
removed code. But most recent
models are trained on ~5% code. Why?

• Code-speci c models such as OpenAI


Codex (2021) was GPT-3 further
trained on public GitHub code.

• Empirically, this improved


performance on non-code tasks!

• Open-source dataset: The Stack (3TB


of permissively licensed source code)

Yao Fu et al. How does GPT Obtain its Ability? 61


fi
fi

LLMBC 2023

And there's another important part of the


story: Instruction Tuning
LLMBC 2023

Few-shot vs Zero-shot

• At the time of GPT-3 (2020), the


mindset was mostly few-shot

- e.g. text completion

• By the time of ChatGPT (2022), the


mindset was all zero-shot

- e.g. instruction-following

63

LLMBC 2023

Supervised Fine-tuning

• Very little text in the original GPT-3


dataset is of the zero-shot form.

• To improve performance on zero-shot


inputs, ne-tuned on a smaller high-
quality dataset of instructions-
completions

• (Sourced from thousands of contractors)

https://openai.com/blog/how-should-ai-systems-behave 64
fi

LLMBC 2023

InstructGPT/GPT-3.5

• Had humans rank di erent


GPT-3 outputs, and used
RL to further ne-tune the
model

• Much better at following


instructions

• Released as text-
davinci-002 in OpenAI API

https://openai.com/blog/instruction-following/ 65

fi
ff
LLMBC 2023

ChatGPT

• Further RLHF on conversations

• ChatML format (messages from system, assistant, user roles)

66

LLMBC 2023

The GPT Lineage

Yao Fu's How does GPT Obtain its Ability? 67


LLMBC 2023

"Alignment Tax"

• Instruction-tuning increases the


model's zero-shot ability, but at a
cost

- Con dence becomes less


calibrated Base GPT-4 RLHF GPT-4

- Few-shot ability su ers

GPT-4 Technical Report - https://arxiv.org/pdf/2303.08774.pdf 68


fi

ff
LLMBC 2023

It's possible to "steal" RLHF

• Got 52K instruction-following demonstrations from text-


davinci-003, then ne-tuned LLaMA on them.

69
fi
LLMBC 2023

OpenAssistant

• April 2023 dataset release

• 160K messages across 66K


conversation trees, 35 languages,
460K quality ratings, 13.5K
volunteers

https://huggingface.co/datasets/OpenAssistant/oasst1 70

LLMBC 2023

And one last idea


LLMBC 2023

Retrieval-enhanced Transformer (2021)


• Instead of both learning language and memorizing facts in the
model's params, why not just learn language in params, and
retrieve facts from a large database?

• BERT-encode sentences, store them in large DB (>1T tokens)

• Then, fetch matching sentences


and attend to them.

• Doesn't work as well as large LLMs.


Yet.

https://arxiv.org/pdf/2112.04426.pdf Dec 2021

72



LLMBC 2023

Resources

• Lillian Weng's "The Transformer Family v2" megapost

• Xavier Amatriain's Transformer Models Catalog

• Yao Fu's How does GPT Obtain its Ability?

73

LLMBC 2023

Questions?

74
04

Training & Inference

50
LLMBC 2023

Problems with training LLMs

• Massive amounts of data

• Massive models don't t on a single GPU or even a single multi-


GPU machine

• Long training runs are painful

76

fi

LLMBC 2023

Problems with training LLMs

• Massive amounts of data

• Massive models don't t on a single GPU or even a single multi-


GPU machine

• Long training runs are painful

77

fi

LLMBC 2023

Problems with training LLMs

• Massive amounts of data

• Massive models don't t on a single multi-GPU machine

• Long training runs are painful

78
fi

LLMBC 2023

Parallelism

• Data parallelism: spread a single batch of data


across GPUs

• Model parallelism: spread the model's layers


across GPUs

• Tensor parallelism: spread a single matrix op


across GPUs

https://openai.com/blog/techniques-for-training-large-neural-networks/ 79

LLMBC 2023

BLOOM (GPT-3 sized LM)

o r
ns
Te

Had to use multiple tricks ("3D parallelism")


from two great libraries: DeepSpeed and Megatron-LM

https://huggingface.co/blog/bloom-megatron-deepspeed
https://www.deepspeed.ai/training/ 80

LLMBC 2023

Sharded Data-Parallelism

Literally pass around model params between GPUs as computation is proceeding!

Helpful video:
https://www.microsoft.com/en-us/research/blog/zero-deepspeed-new-system-optimizations-enable-training-models-with-over-100-billion-parameters/ 81
LLMBC 2023

Problems with training LLMs

• Massive amounts of data

• Massive models don't t on a single GPU or even a single multi-


GPU machine

• Long training runs are painful

82

fi

LLMBC 2023

A glimpse into training hell

• Dozens of manual restarts, 70+ automatic restarts due to HW


failures

• Manual restarting from checkpoints when loss would diverge

• Switching optimizers and software versions in the middle of


training

https://arxiv.org/pdf/2205.01068.pdf 83

LLMBC 2023

"Training run babysitting"

https://openai.com/contributions/gpt-4 84
LLMBC 2023

Considerations for LLM inference

• Understanding auto-regressive sampling

• Improving (or not) runtime complexity

• Dealing with large model size

85

LLMBC 2023

Auto-regressive Sampling
• Remember that we sample tokens one at a time
- [It's, a, blue, ...

• The softmax outputs a peaky probability


distribution over possible next tokens

• Temperature parameter makes it less peaky


- t=0 will always sample the most likely next token;
- t=1 will often sample less-likely ones
• Human text is not all high-probability next words!

https://arxiv.org/abs/1904.09751 86

LLMBC 2023

Runtime Complexity
• Self-attention runs in O(N^2) for
sequence length N

• Many O(N) approximations have been


developed

• But none have provided a strict


improvement

• Recently, FlashAttention sped things


up via smart GPU programming

Based on "E cient Transformers: A Survey" by Yi Tay, Mostafa Dehghani, Dara Bahri, Donald Metzler
and "Long Range Arena: A Benchmark for E cient Transformers" by Y Tay, M Dehghani, S Abnar, Y Shen, D Bahri, P Pham, J Rao, L Yang, S Ruder, D Metzler
87
http://lucasb.eyer.be/transformer https://github.com/HazyResearch/ ash-attention
ffi

ffi
fl
LLaMA Quantization LLMBC 2023
Dealing with Large Model Sizes
• Large subject! Lilian Weng from OpenAI
has a thorough post

• Quantization is most relevant to us https://github.com/ggerganov/llama.cpp

- LLM weights are usually in oat32 or 16

- Recent work (LLM.int8) has shown that


8-bit post-quantization is basically ne

- Even 4-bit seems ne!

https://arxiv.org/pdf/2208.07339.pdf
https://arxiv.org/pdf/2212.09720.pdf 88
fi

fl

fi

LLMBC 2023

Resources

• Megatron-LM (GitHub): probably still the best insights into training


LLMs at scale.

• OpenAI post Techniques for Training Large Neural Networks

• Lillian Weng's "Large Transformer Model Inference Optimization"

89

LLMBC 2023

Questions?

90
LLMBC 2023

Questions?

91
LLMBC 2023

Thanks!

@sergeykarayev /imagine green tropical parrot eating


stack of pancakes, flapjack breakfast,
hyper-realistic portrait, DSLR Canon
R5, chromatic aberration, accent
@full_stack_dl lighting, super resolution, hyper-
detailed, cinematic, OpenGL - Shaders

You might also like