0% found this document useful (0 votes)

26 views45 pages

Lecture 15 - Foundation Models - CLIP and GPT

The document discusses foundation models in machine learning, focusing on their training and application methods, including zero-shot and fine-tuning approaches. It highlights notable models such as GPT and CLIP, detailing their architectures, training data, and performance on various tasks. Key takeaways include the effectiveness of pre-training and transfer learning in developing robust models capable of handling diverse tasks without extensive manual labeling.

Uploaded by

abhi23185

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

26 views45 pages

Lecture 15 - Foundation Models - CLIP and GPT

Uploaded by

abhi23185

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 45

Foundation

Models
Applied Machine Learning
Derek Hoiem

Dall-E
Last class: Transformer Models

Transformers are efficient, multi-

modal data processors
This lecture
• Foundation models: Models that are trained on exorbitant data and compute on a broad
task, often intended as a starting point for specialized models

• Key questions for foundation models are

– How to train them (what architecture, what data, what objective)
– How to apply them, e.g.
• Zero-shot: apply to new tasks without any training examples for those specific tasks
• Linear probe: train a linear model on the features
• Fine-tune: adjust the entire network to perform better in the target task

• We previously saw two examples of foundation models suitable for fine-tuning

– ImageNet pretrained models for vision
– BERT for language

• We will now learn about two more famous models

– GPT: Generative Pretraining Models for Language
– CLIP: Contrastive Language-Image Pretraining for Vision
GPT1 - Improving Language Understanding by Generative
Pre-Training (Radford et al. 2018)
GPT1 (2018)
• Pre-cursor to BERT (2019) that we discussed last class

• Similar architecture and training procedures

– 117M parameters in GPT1 vs. 340M for BERT Large

• Pre-training: Maximize data likelihood as a product of conditional

probabilities, trained on Books Corpus
– Predict each token based on the k tokens (the “context”) that came before

• Fine-tuned for each task while also retaining the generative

objective. Some tasks need to be processed in a special way

• Achieved state-of-art in 9 out of 12 tasks

GPT-2 (Radford et al. 2019) - Language Models are
Unsupervised Multitask Learners
Aims to create a general purpose language learner
“Current systems are better characterized as narrow experts rather than competent generalists. We would like to move
towards more general systems which can perform many tasks – eventually without the need to manually create and label a
training dataset for each one.

The dominant approach to creating ML systems is to collect a dataset of training examples demonstrating correct
behavior for a desired task, train a system to imitate these behaviors, and then test its performance on independent
and identically distributed (IID) held-out examples. This has served well to make progress on narrow experts. But
the often erratic behavior of captioning models (Lake et al., 2017), reading comprehension systems (Jia & Liang, 2017),
and image classifiers (Alcorn et al., 2018) on the diversity and variety of possible inputs highlights some of the
shortcomings of this approach.

Our suspicion is that the prevalence of single task training on single domain datasets is a major contributor to the lack
of generalization observed in current systems. Progress towards robust systems with current architectures is likely
to require training and measuring performance on a wide range of domains and tasks.”

PDF
GPT-2
• A general systems should learn to model
𝑃𝑃(𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜|𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖, 𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡)

• The task can be specified in natural language, so language

tasks can be framed as sequence-to-sequence text processing

• Sequence-to-sequence: A problem formulated as receiving

input in some modality and producing output some modality
(instead of e.g. predicting probability for labels in a specific
task)
GPT-2: Data and Training
• WebText Dataset: Created a new web scrape of pages linked
from Reddit with at least 3 karma, as these should be of
reasonable quality
– Does not require additional manual annotation
– Yields 8 million documents (40GB text) from before 2018 after de-
duplication and cleaning
– Removed Wikipedia, since it is commonly used in test sets

• GPT-2 is generatively trained on WebText data and not fine-

tuned on anything else
GPT-2 Architecture and Model Sizes
• Architecture is basically the same as GPT-1 and BERT

GPT-1 Size
BERT Size
GPT-2: Zero shot results

Perplexity (PPL) is 2^entropy; lower is better

• Achieves state-of-art in many tasks without tuning for them

• Performs much worse than state-of-art in summarization and

translation (though can effectively translate word for word)
See many more
examples in the paper
Continued log-linear improvement with model size
Conclusion: “The diversity of tasks
the model is able to perform in a
zero-shot setting suggests that
high-capacity models trained to
maximize the likelihood of a
sufficiently varied text corpus
begin to learn how to perform a
surprising amount of tasks
without the need for explicit
supervision.”
In the OpenAI board room…
GPT-3 (Brown et al. 2020)
Models and Architectures
Training data
Training compute

List price of
compute to train
GPT-3 175B:
~$4.5M
Few-shot “In
Context Learning”
GPT-3

GPT-2 scale
Accuracy on a simple task to remove random symbols from a word
GPT-3
Performance of GPT-3

• Average performance of few-shot is about the same

as fine-tuned BERT-Large, but varies by task
• Per-task specialized SOTA models are still best
GPT-3 Arithmetic
What to learn from the GPT Series
• GPT: generative-pretraining (GPT) is effective for large
language models

• GPT-2: GPT models can perform reasonable zero-shot task

performance with larger models trained on more data

• GPT-3: Even larger GPT models trained on even more data

are good at many tasks, especially text generation, and
can be “trained” at inference time with in-context
examples
What should my students learn from the GPT papers?
ChatGPT says:

The GPT (Generative Pre-trained Transformer) papers are a series of influential papers that introduced and improved upon the architecture of
the transformer model for natural language processing tasks. Here are some of the key concepts and takeaways that your students can learn
from these papers:

1. Transformers: The GPT papers introduce the transformer model, which is a type of neural network architecture that uses self-attention
mechanisms to process sequences of data. Transformers have become a fundamental building block for many state-of-the-art natural
language processing models.

2. Pre-training: The GPT papers demonstrate the effectiveness of pre-training transformer models on large amounts of text data before fine-
tuning them on specific downstream tasks. Pre-training allows the model to learn general language features that can be transferred to
multiple tasks.

3. Language modeling: The GPT papers use language modeling as the pre-training task, where the model learns to predict the next word in a
sequence given the previous words. Language modeling is a fundamental task in natural language processing and is used in many downstream
applications.

4. Transfer learning: The GPT papers demonstrate the power of transfer learning, where a pre-trained model can be fine-tuned on a specific
downstream task with relatively little data. This allows the model to quickly adapt to new tasks and achieve state-of-the-art performance.

5. Hyperparameter tuning: The GPT papers provide insights into the importance of hyperparameter tuning, such as the learning rate, batch
size, and model size, for achieving the best performance on specific tasks.

6. Overall, the GPT papers are a valuable resource for understanding the transformer model and its applications in natural language
processing. Your students can learn about the importance of pre-training, transfer learning, and hyperparameter tuning, as well as gain
insights into the latest state-of-the-art techniques for language modeling and other natural language processing tasks.

Wrong statements in red, good points in green

On the other hand,
There once was a class so great
Applied Machine Learning, first-rate
The students all learned
And their skills were discerned
Now their models can predict with high rate!
– Chat GPT
[two minute break]
How much of our thoughts and conversation are just next
word prediction?
CLIP: Learning Transferrable Models from Natural
Language Supervision (Radford et al. 2021)

First key idea: use a text encoder as a classifier

pdf
First key idea: use a text encoder as a classifier
• This is an old idea – words and pictures work goes back to
~2000, but at a smaller scale

• How to scale?
– Learn from natural language supervision (not tags or class labels)
– Scrape 400 million image/text pairs
– “Bag of words” language representation
– Contrastive objective, instead of predicting exact language
– Use transformer architecture
Second key idea(s): contrastively match gestalt text to image

• Use small transformer

language model (76M
parameters for base)
• Matching task with large
batch (size = 32,768)
– Each image and text from
batch is encoded
– Similarity score obtained
for 32K x 32K image-text
pairings
– Loss is cross-entropy on
matching each image to its
text, and each text to its Contrastive task formulations is a good
image general way to learn when exact target is
unpredictable
Training cost
• “The largest ResNet model RN50x64, took 18 days to train on
592 V100 GPUs, while the largest Vision Transformer took 12
days on 256 V100 GPUs”
– ~$91K for Transformer model; $300K for ResNet model
Key idea 3: zero-shot classification
Every batch of training is like
a novel classification task,
matching 32K classes to 32K
images

To create a new
classification task:
1. Convert class labels into
captions and encode the
text
2. Encode the image
3. Assign the image to the
label whose caption
matches best
Four ways to adapt CLIP to a new task
1. Zero-shot: convert labels to text and use text-image similarity

2. Linear probe: freeze the image encoder and train a linear

layer on its features

3. Nearest neighbor (not in paper): record features of training

examples and use K-NN classifier

4. Fine-tune CLIP encoder for the new task (but then it

completely loses its generality)
Zero shot prediction examples (randomly selected)
• Zero-shot clip performs as
well as a strong baseline
trained on 16 examples
per class
• Linear probe needs 4
examples to reach zero-
shot performance (on
average)
What to remember
• Deep learning application often involves
starting with a pre-trained “foundation”
model and fine-tuning it

• GPT demonstrates that learning to

predict the next word produces a flexible
zero-shot and few-shot general language
task performer

• CLIP shows that learning to match images

to text produces a good zero-shot
classifier and an excellent image encoder
Coming up
• Thursday: exam
– Can come to lecture at 9:30 to ask me questions (other than “what is
on the exam”)
• Next week: spring break!
• After that: Creating ML applications, and impact of AI/ML

14-LookingForward
No ratings yet
14-LookingForward
48 pages
Perspectives in Business Ethics
No ratings yet
Perspectives in Business Ethics
113 pages
2005 14165v3 PDF
No ratings yet
2005 14165v3 PDF
74 pages
cl13_gpt-2
No ratings yet
cl13_gpt-2
26 pages
cl13_gpt
No ratings yet
cl13_gpt
26 pages
lec20.LLM
No ratings yet
lec20.LLM
58 pages
2AMM30+AY23 24+Text+Mining+Lecture+3
No ratings yet
2AMM30+AY23 24+Text+Mining+Lecture+3
88 pages
Huggingface Co Blog Warm Starting Encoder Decoder Data Preprocessing
No ratings yet
Huggingface Co Blog Warm Starting Encoder Decoder Data Preprocessing
20 pages
BTech Advanced AI Unit03
No ratings yet
BTech Advanced AI Unit03
109 pages
Session 15-2 Future NLP & Deep Learning
No ratings yet
Session 15-2 Future NLP & Deep Learning
81 pages
Jason Wei Stanford cs330 Talk
No ratings yet
Jason Wei Stanford cs330 Talk
44 pages
Overview of The Transformer-Based Models For NLP Tasks
No ratings yet
Overview of The Transformer-Based Models For NLP Tasks
5 pages
GEN-AI-unit 3
No ratings yet
GEN-AI-unit 3
30 pages
Bay Learn 2015 Deep Mind
No ratings yet
Bay Learn 2015 Deep Mind
69 pages
12. LLM Prompting & In-Context Learning
No ratings yet
12. LLM Prompting & In-Context Learning
18 pages
Megatron LM
No ratings yet
Megatron LM
15 pages
Augmenting LLMs Survey
No ratings yet
Augmenting LLMs Survey
33 pages
Self-Supervision, Bert, and Beyond: Building Transformer-Based Natural Language Processing Applications (Part 2)
No ratings yet
Self-Supervision, Bert, and Beyond: Building Transformer-Based Natural Language Processing Applications (Part 2)
117 pages
Unit-5 (DL For Different Domains, Role of GPUs and DL Frameworks)
No ratings yet
Unit-5 (DL For Different Domains, Role of GPUs and DL Frameworks)
15 pages
Learning Transferable Visual Models From Natural Language Supervision
No ratings yet
Learning Transferable Visual Models From Natural Language Supervision
47 pages
Transformer Basics
No ratings yet
Transformer Basics
17 pages
10.48550 Arxiv.2204.02311
No ratings yet
10.48550 Arxiv.2204.02311
87 pages
Learning Transferable Visual Models From Natural Language Supervision
No ratings yet
Learning Transferable Visual Models From Natural Language Supervision
48 pages
1719720399971
No ratings yet
1719720399971
51 pages
LLM Learning
No ratings yet
LLM Learning
56 pages
REPORT-MTechPESJul23BGrp2-3 (22-02-25)
No ratings yet
REPORT-MTechPESJul23BGrp2-3 (22-02-25)
15 pages
Deep Unsupervised Learning
No ratings yet
Deep Unsupervised Learning
90 pages
19 20-gpt-3 Prompts
No ratings yet
19 20-gpt-3 Prompts
68 pages
Jacob Devlin BERT
No ratings yet
Jacob Devlin BERT
43 pages
All about Encoder-Decoder Models
No ratings yet
All about Encoder-Decoder Models
50 pages
Basics of NLP
No ratings yet
Basics of NLP
9 pages
DAB311 DL Week 11 RNN
No ratings yet
DAB311 DL Week 11 RNN
25 pages
Foundations of Large Language Models 1738142777
No ratings yet
Foundations of Large Language Models 1738142777
101 pages
Foundations of LLM
No ratings yet
Foundations of LLM
231 pages
5th Unit
No ratings yet
5th Unit
36 pages
Training the application of LLM
No ratings yet
Training the application of LLM
68 pages
Wavelets Meet Large Language Models
No ratings yet
Wavelets Meet Large Language Models
16 pages
Pre Trained Models For NLP
No ratings yet
Pre Trained Models For NLP
15 pages
Workshop AI Baker PDF
No ratings yet
Workshop AI Baker PDF
88 pages
Chowdhery Et Al. - 2022 - PaLM Scaling Language Modeling With Pathways
No ratings yet
Chowdhery Et Al. - 2022 - PaLM Scaling Language Modeling With Pathways
83 pages
XCS224N_Module4_Slides
No ratings yet
XCS224N_Module4_Slides
91 pages
NeurIPS 2020 Language Models Are Few Shot Learners Paper
No ratings yet
NeurIPS 2020 Language Models Are Few Shot Learners Paper
25 pages
Christopher Manning Lecture 5: Language Models and Recurrent Neural Networks (Oh, and Finish Neural Dependency Parsing J)
No ratings yet
Christopher Manning Lecture 5: Language Models and Recurrent Neural Networks (Oh, and Finish Neural Dependency Parsing J)
66 pages
CS480 Lecture November 28th
No ratings yet
CS480 Lecture November 28th
96 pages
Pretraining and Evaluation CodeLLMs
No ratings yet
Pretraining and Evaluation CodeLLMs
71 pages
Transformers
No ratings yet
Transformers
27 pages
MLSys Class LLM Introduction
No ratings yet
MLSys Class LLM Introduction
43 pages
1 s2.0 S2095809922006324 Main
No ratings yet
1 s2.0 S2095809922006324 Main
20 pages
2108.05542
No ratings yet
2108.05542
42 pages
Transformer
No ratings yet
Transformer
5 pages
Don't Teach. Incentivize
No ratings yet
Don't Teach. Incentivize
59 pages
2025-Lecture06-MachineLearning
No ratings yet
2025-Lecture06-MachineLearning
56 pages
Cs224n Text Generation
No ratings yet
Cs224n Text Generation
73 pages
The Diverse Landscape of Large Language Models Deepsense Ai
No ratings yet
The Diverse Landscape of Large Language Models Deepsense Ai
16 pages
Lecture Notes
No ratings yet
Lecture Notes
86 pages
Clip
No ratings yet
Clip
15 pages
AI for Everyone: An Intermediate Guide to Artificial Intelligence
From Everand
AI for Everyone: An Intermediate Guide to Artificial Intelligence
Nova Clarke
No ratings yet
CHATGPT DALL.E 3: Complete Guide. Third Edition
From Everand
CHATGPT DALL.E 3: Complete Guide. Third Edition
Hesham Mohamed Elsherif
No ratings yet
Unveiling the Secrets of ChatGPT Inside the Mind of an AI
From Everand
Unveiling the Secrets of ChatGPT Inside the Mind of an AI
Nelson Ambrose
No ratings yet
The Newbie’s Guidebook to ChatGPT: A Beginner's Tutorial: The Newbie’s Guidebook
From Everand
The Newbie’s Guidebook to ChatGPT: A Beginner's Tutorial: The Newbie’s Guidebook
Timothy King
No ratings yet
VijayPradeepKumar SA
0% (1)
VijayPradeepKumar SA
2 pages
G-GPU A Fully-Automated Generator of GPU-like ASIC Accelerators
No ratings yet
G-GPU A Fully-Automated Generator of GPU-like ASIC Accelerators
4 pages
March
No ratings yet
March
9 pages
Log
No ratings yet
Log
210 pages
Online Authorisation System Version 3.7 R02 Processing Reference Guide
No ratings yet
Online Authorisation System Version 3.7 R02 Processing Reference Guide
123 pages
M06 - Configure & Administer Server
No ratings yet
M06 - Configure & Administer Server
221 pages
Module - 5: Python Application Programming
No ratings yet
Module - 5: Python Application Programming
25 pages
2 Rise of HCI
No ratings yet
2 Rise of HCI
13 pages
Udemy Course DP-900 Microsoft Azure Data Fundaments Guide Part 1 of 2
No ratings yet
Udemy Course DP-900 Microsoft Azure Data Fundaments Guide Part 1 of 2
12 pages
Class 5 Computer_070024
No ratings yet
Class 5 Computer_070024
2 pages
New Oriental 1000NC - 202201
100% (1)
New Oriental 1000NC - 202201
11 pages
Mastering-Excel-Functions-Unlock-the-Power-of-Spreadsheet-Automation
No ratings yet
Mastering-Excel-Functions-Unlock-the-Power-of-Spreadsheet-Automation
8 pages
MRP & MPS
100% (2)
MRP & MPS
2 pages
CHANGELOG LittleNavmap
No ratings yet
CHANGELOG LittleNavmap
155 pages
mobile-application-development-22617-winter-2023-model-answer-22617
No ratings yet
mobile-application-development-22617-winter-2023-model-answer-22617
49 pages
Logic Slides
No ratings yet
Logic Slides
23 pages
1Y0-312 Examcollection 99q
No ratings yet
1Y0-312 Examcollection 99q
33 pages
Lesson 5 - Android Architecture Components and Room Database - vVLE
No ratings yet
Lesson 5 - Android Architecture Components and Room Database - vVLE
39 pages
Quick Start Lp7 NW
No ratings yet
Quick Start Lp7 NW
1 page
CH 01
No ratings yet
CH 01
16 pages
Dbms Miniproject Report
No ratings yet
Dbms Miniproject Report
34 pages
Sign Up For Medium and Get An Extra One: How To Easily Convert A Python Script To An Executable File (.Exe)
No ratings yet
Sign Up For Medium and Get An Extra One: How To Easily Convert A Python Script To An Executable File (.Exe)
14 pages
Cloud Watch PDF
100% (1)
Cloud Watch PDF
16 pages
DS 03 PDF
No ratings yet
DS 03 PDF
16 pages
Flutter Bottom Navigation Bar
No ratings yet
Flutter Bottom Navigation Bar
6 pages
TM09 Monitoring and Supporting Data Conversion
No ratings yet
TM09 Monitoring and Supporting Data Conversion
28 pages
Kinetic TechrefSystemAdministration 2021.1
No ratings yet
Kinetic TechrefSystemAdministration 2021.1
359 pages
Texas-Knights - Jan-Feb-2024
No ratings yet
Texas-Knights - Jan-Feb-2024
32 pages
9 Communication Platforms
No ratings yet
9 Communication Platforms
8 pages
RSVPify Invite List Import Template
No ratings yet
RSVPify Invite List Import Template
4 pages

Lecture 15 - Foundation Models - CLIP and GPT

Uploaded by

Lecture 15 - Foundation Models - CLIP and GPT

Uploaded by

Foundation

Transformers are efficient, multi-

• Key questions for foundation models are

• We previously saw two examples of foundation models suitable for fine-tuning

• We will now learn about two more famous models

• Similar architecture and training procedures

• Pre-training: Maximize data likelihood as a product of conditional

• Fine-tuned for each task while also retaining the generative

• Achieved state-of-art in 9 out of 12 tasks

• The task can be specified in natural language, so language

• Sequence-to-sequence: A problem formulated as receiving

• GPT-2 is generatively trained on WebText data and not fine-

Perplexity (PPL) is 2^entropy; lower is better

• Achieves state-of-art in many tasks without tuning for them

• Performs much worse than state-of-art in summarization and

• Average performance of few-shot is about the same

• GPT-2: GPT models can perform reasonable zero-shot task

• GPT-3: Even larger GPT models trained on even more data

Wrong statements in red, good points in green

First key idea: use a text encoder as a classifier

• Use small transformer

2. Linear probe: freeze the image encoder and train a linear

3. Nearest neighbor (not in paper): record features of training

4. Fine-tune CLIP encoder for the new task (but then it

• GPT demonstrates that learning to

• CLIP shows that learning to match images

You might also like