Introduction To LLMS: Transformers Types of Llms Configuration Settings
Introduction To LLMS: Transformers Types of Llms Configuration Settings
Large Language Models (LLMs)
Large neural networks trained at internet scale Decoder only = Autoregressive model
to estimate the probability of sequences
of words Ex: GPT, BLOOM
Ex: GPT, FLAN-T5, LLaMA, PaLM, BLOOM
2 Random Sampling The model chooses
(transformers with billions of parameters)
an output word at random using the probability
Abilities (and computing resources needed)
distribution to weigh the selection (could be
tend to rise with the number of parameters
To predict the next token
PRE-TRAINING OBJECTIVE too creative)
USE CASES based on the previous sequence of tokens TECHNIQUES TO CONTROL RANDOM SAMPLING
– Standard NLP tasks (= Causal Language Modeling)
– Top K The next token is drawn from
(classification, summarization, etc.) OUTPUT Next token the k tokens with the highest probabilities
– Content generation USE CASES Text generation – Top P The next token is drawn from
– Reasoning (Q&A, planning, coding, etc.) Token Word or sub-word the tokens with the highest probabilities,
The basic unit processed by transformers Encoder-Decoder = Seq-to-seq model whose combined probabilities exceed p
Encoder Processes input sequence Ex: T5, BART
to generate a vector representation (or
embedding) for each token
Decoder Processes input tokens to produce
new tokens
In-context learning Specifying the task
to perform directly in the prompt Embedding layer Maps each token
to a trainable vector
Temperature Influence the shape of
Positional encoding vector the probability distribution through a scaling
Added to the token embedding vector PRE-TRAINING OBJECTIVE Vary from model to model
factor in the softmax layer
to keep track of the token’s position (e.g., Span corruption like T5)
OUTPUT Sentinel token + predicted tokens
Self-Attention Computes the importance
of each word in the input sequence to all USE CASES Translation, Q&A, summarization
other words in the sequence
© 2024 Dataiku
TASKSPECIFIC FINETUNING MULTI-TASK FINE-TUNING MODEL EVALUATION
PEFT Prompt tuning: Add trainable tensors to the model input embeddings,
Full fine-tuning of LLMs is challenging: commonly known as “soft prompts,” optimized directly through
1 - Keep the majority of the original
gradient descent.
h = W0.x + AB.x LLM weights frozen.
weights of the model by converting the precision Researchers explored trade-offs between
from 32bit to 16bit or 8bit integers. the dataset size, the model size, and the
LARGE LANGUAGE MODEL CHOICE LLMs are massive and require plenty of memory compute budget:
for training and inference.
FP32 space
Increasing compute may seem ideal for better
Generative AI Project Lifecycle To load the model into GPU RAM: performance, but practical constraints like
3 x 10-38 0.0 3 x 1038 hardware, time, and budget limit its feasibility.
1 parameter (32-bit precision) = 4 bytes needed
Adapt
(prompt App integration 1B parameters = 4 x 109 bytes = 4GB of GPU
Use case engineering, (model
Model
definition Selection fine tuning), optimization, Constraint
& scoping augment, deployment) Pre-training requires storing additional components,
and evaluate Compute budget
model beyond the model’s parameters:
FP16 | BFLOAT16 | INT8 | INT4
• Optimizer states (e.g., 2 for Adam)
Scaling choice Scaling choice
Two options for model selection • Gradients
Dataset size Model size
• Forward activations Model
Quantization maps the FP32 numbers to a lower # of tokens
performance
# of parameters
• Use a pre-trained LLM. • Temporary variables precision space by employing scaling factors
• Train your own LLM from scratch. This could result in an additional 12-20 bytes of determined from the range of the FP32 numbers.
But, in general... memory needed per model parameter.
…develop your application using a pre-trained LLM, It has been empirically shown that, as the compute
except if you work with extremely specific data In most cases, quantization strongly reduces budget remains fixed:
(i.e., medical, legal) memory requirements with a limited loss
This would mean it requires 16 GB to 24 GB of
in prediction. Fixed model size: Increasing training dataset
Hubs: Where you can browse existing models GPU memory to train a 1-billion parameter
size improves model performance.
Model Cards: List of best use cases, training details, LLM, around 4-6x the GPU RAM needed just for
BFLOAT16 is a popular alternative to FP16:
limitations on models. storing the model weights.
Fixed dataset size: Larger models
The model choice will depend on the details • Developed by Google Brain demonstrate lower test loss, indicating
of the task to carry out. • Balances memory efficiency and accuracy enhanced performance.
• Wider dynamic range
Hence, the memory needed for LLM training is: • Optimized for storage and speed in ML tasks
Model pre-training:
e.g., FLAN T5 pre-trained using BFLOAT16
Model weights are adjusted in order to minimize the Excessive for consumer hardware What’s the optimal balance?
loss of the training objective.
It requires significant computational resources, Even demanding for data center hardware Benefits of quantization:
Once scaling laws have been estimated, we can use the
(i.e., GPUs, due to high computational load). (for single processor training). Chinchilla approach, i.e., we can choose the dataset
For instance, NVIDIA A100 supports up to Less memory
size and the model size to train a compute-optimal
80GB of RAM. model, which maximizes performance for a given
Potentially better model performance
PaLM compute budget. The compute-optimal training dataset
GPT-3 YaLM GPT-2 BERT size is ~20x the number of parameters.
Higher calculation speed
Number
of parameters
540B 175B 100B 1.5B 110M
RLHF PRINCIPLES COLLECTING HUMAN FEEDBACK REWARD MODEL
• Reinforcement Learning With Human Feedback The reward model assesses alignment of LLM outputs
(RLHF): Preference data is used to train a reward model with human preferences. Assign 1 for the Place the preferred
that mimic human annotator preferences, which then The reward values obtained are then used to update the preferred response and option first by
scores LLM completions for reinforcement learning 0 for the rejected one reordering
LLM weights and train a new human-aligned version,
response in each pair. completions.
adjustments. with the specifics determined by the optimization
• Preference Optimization (DPO, IPO): Minimize a algorithm.
training loss directly on preference data.
PPO ALGORITHM FOR LLMS REWARD HACKING RL FROM AI FEEDBACK
Prompt updated
LLM Constitutional AI (Bai, Yuntao, et al., 2022)
The PPO objective is used to update the LLM weights “The movie was...” “...an absolute thrill
Updated
LLM Reinforcement learning Critique and revise
responses based on Harmful prompts,
3 constitutional principles revised completions
Estimated Actual Reward
1: Text Generation
2: Scoring N iterations
future total reward from the reward model
DIRECT PREFERENCE
3: Model weights update with
reinforcement learning. OPTIMIZATION Fine-
tuned
3 Fine-tune a
pre-trained LLM
Policy Loss: Maximize it to get higher rewards while LLM
criteria for helpfulness. Entropy Loss: Maximize it to promote and sustain Identity Preference Optimization (IPO) is a variant 7
Result: A policy trained by Reinforcement
model creativity. of DPO less prone to overfitting. Learning with AI Feedback (RLAIF)
Updated model: The resulting updated model should
be more aligned with human preferences. Comparison
data Fine
tuned
Reinforcement learning algorithm: Proximal policy The higher the entropy, the more creative the policy. LLM
DPO (or IPO)
optimization (PPO) is a popular choice.
LLMINTEGRATED APPLICATIONS LLM REASONING WITH PROGRAMAIDED LANGUAGE & REACT
CHAINOFTHOUGHT PROMPTING
LLM-Powered Applications • Knowledge can be out of date.
• LLMs struggle with certain tasks (e.g., math). Complex reasoning is challenging for LLMs.
Program-Aided Language (PAL)
• LLMs can confidently provide wrong answers Generate scripts and pass it to the interpreter.
E.g., problems with multiple steps, mathematical reasoning
("hallucination").
MODEL OPTIMIZATION LLM should serve as a reasoning engine. Prompt
FOR DEPLOYMENT Leverage external app or data sources The prompt and completion are important! Q: Roger has 5 tennis balls. [...]
A: CoT reasoning
Inference challenges: High computing and storage demands LLM-integrated application # Roger started with 5 tennis balls
1.Plan actions 2.Format outputs 3.Validate actions
Ext data sources tennis_balles=5 PAL execution
Shrink model size, maintain performance Orchestrator
Set of instructions Requires formatting Collect information
Frontend E.g., # 2 cans of tennis balls each is
User Step1: Get for applications to that allows validation
bought_balls=2*3
Model Distillation Ext applications customer ID understand actions of an action
API Step2: Reset # tennis balls. The answer is
Python password answer = tennis_balls + bought_balls
• Scale down model complexity while preserving accuracy. LLM
Q. [...]
• Train a small student model to mimic a large frozen
teacher model. Chain-of-Thought (CoT)
Retrieval Augmented Generation (RAG) Completion is handed off to a Python interpreter.
• Prompts the model to break down problems into
LLM Soft AI framework that integrates external data sources sequential steps. Calculations are accurate and reliable.
Teacher labels
Distillation and apps (e.g., documents, private databases, etc.). • Operates by integrating intermediate reasoning steps ReAct
Knowledge
distillation Soft loss Multiple implementations exist, will depend on the into examples for one-or few-shot inference.
predictions
LLM details of the task and the data format. Prompting strategy that combines CoT reasoning and
Student Hard
Prompt action planning, employing structured examples to
Labeled predictions
training data Student
Q: Roger has 5 tennis balls. He buys 2 more cans of guide an LLM in problem-solving and decision-making
loss
Hard Retriever for solutions
labels
tennis balls. Each can has 3 tennis balls. How many
Query Query External
LLM Answer tennis balls does he have now?
encoder knowledge Instructions: Define the task,
User A: Roger started with 5 balls. 2 cans of 3 tennis Instructions
• Soft labels: Teacher completions serve as ground what is a thought and the
truth labels. balls each is 6 tennis balls. 5+6=11. The answer is 11. actions
• Student and distillation losses update student model • We retrieve documents most similar to the input query Q: The cafeteria had 23 apples. If they used 20 to Thought: Analysis of the Question
weights via backpropagation. current situation and the
in the external data. make lunch and bought 6 more, how many apples next steps to take
Thought
• The student LLM can be used for inference. • We combine the documents with the input query and do they have? Action: The actions are from
send the prompt to the LLM to receive the answer. a predetermined list and
Action
Post Training Quantization (PTQ) Completion
defined in the set of
instructions in the prompt
Observation
A: The cafeteria had 23 apples. They used 20 to The loop ends when the
PTQ reduces model weight precision to 16-bit float or ! Size of the context window can be a limitation. make lunch. 23-20=3. They bought 6 more apples, action is finish []
8-bit integer. Use multiple chunks (e.g., with LangChain) so 3+6=9. The answer is 9. Observation: Result of the Question to be answered
previous action
• Can target both weights and activation layers for impact. ! Data must be in format that allows its relevance
• May sacrifice performance, yet beneficial for cost to be assessed at inference time. In the completion, the whole prompt is included.
savings and performance gains. LangChain can be used to connect multiple
Use embedding vectors (vector store)
Improves performance but struggles with components through agents, tools, etc.
Model Pruning
precision-demanding tasks like tax computation Agents: Interpret the user input and determine which
or discount application. tool to use for the task (LangChain includes agents for
Removes redundant model parameters that contribute Vector database: Stores vectors and associated
little to the model performance. metadata, enabling efficient nearest-neighbor PAL & ReAct).
Solution: Allow the LLM to communicate with a proficient
Some methods require full model training, while others are vector search. math program, as a Python interpreter. ReAct reduces the risks of errors.
in the PEFT category (LoRA).