0% found this document useful (0 votes)
3 views

Lecture 15 - Foundation Models - CLIP and GPT

The document discusses foundation models in machine learning, focusing on their training and application methods, including zero-shot and fine-tuning approaches. It highlights notable models such as GPT and CLIP, detailing their architectures, training data, and performance on various tasks. Key takeaways include the effectiveness of pre-training and transfer learning in developing robust models capable of handling diverse tasks without extensive manual labeling.

Uploaded by

abhi23185
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

Lecture 15 - Foundation Models - CLIP and GPT

The document discusses foundation models in machine learning, focusing on their training and application methods, including zero-shot and fine-tuning approaches. It highlights notable models such as GPT and CLIP, detailing their architectures, training data, and performance on various tasks. Key takeaways include the effectiveness of pre-training and transfer learning in developing robust models capable of handling diverse tasks without extensive manual labeling.

Uploaded by

abhi23185
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 45

Foundation

Models
Applied Machine Learning
Derek Hoiem

Dall-E
Last class: Transformer Models

Transformers are efficient, multi-


modal data processors
This lecture
• Foundation models: Models that are trained on exorbitant data and compute on a broad
task, often intended as a starting point for specialized models

• Key questions for foundation models are


– How to train them (what architecture, what data, what objective)
– How to apply them, e.g.
• Zero-shot: apply to new tasks without any training examples for those specific tasks
• Linear probe: train a linear model on the features
• Fine-tune: adjust the entire network to perform better in the target task

• We previously saw two examples of foundation models suitable for fine-tuning


– ImageNet pretrained models for vision
– BERT for language

• We will now learn about two more famous models


– GPT: Generative Pretraining Models for Language
– CLIP: Contrastive Language-Image Pretraining for Vision
GPT1 - Improving Language Understanding by Generative
Pre-Training (Radford et al. 2018)
GPT1 (2018)
• Pre-cursor to BERT (2019) that we discussed last class

• Similar architecture and training procedures


– 117M parameters in GPT1 vs. 340M for BERT Large

• Pre-training: Maximize data likelihood as a product of conditional


probabilities, trained on Books Corpus
– Predict each token based on the k tokens (the “context”) that came before

• Fine-tuned for each task while also retaining the generative


objective. Some tasks need to be processed in a special way

• Achieved state-of-art in 9 out of 12 tasks


GPT-2 (Radford et al. 2019) - Language Models are
Unsupervised Multitask Learners
Aims to create a general purpose language learner
“Current systems are better characterized as narrow experts rather than competent generalists. We would like to move
towards more general systems which can perform many tasks – eventually without the need to manually create and label a
training dataset for each one.

The dominant approach to creating ML systems is to collect a dataset of training examples demonstrating correct
behavior for a desired task, train a system to imitate these behaviors, and then test its performance on independent
and identically distributed (IID) held-out examples. This has served well to make progress on narrow experts. But
the often erratic behavior of captioning models (Lake et al., 2017), reading comprehension systems (Jia & Liang, 2017),
and image classifiers (Alcorn et al., 2018) on the diversity and variety of possible inputs highlights some of the
shortcomings of this approach.

Our suspicion is that the prevalence of single task training on single domain datasets is a major contributor to the lack
of generalization observed in current systems. Progress towards robust systems with current architectures is likely
to require training and measuring performance on a wide range of domains and tasks.”

PDF
GPT-2
• A general systems should learn to model
𝑃𝑃(𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜|𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖, 𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡)

• The task can be specified in natural language, so language


tasks can be framed as sequence-to-sequence text processing

• Sequence-to-sequence: A problem formulated as receiving


input in some modality and producing output some modality
(instead of e.g. predicting probability for labels in a specific
task)
GPT-2: Data and Training
• WebText Dataset: Created a new web scrape of pages linked
from Reddit with at least 3 karma, as these should be of
reasonable quality
– Does not require additional manual annotation
– Yields 8 million documents (40GB text) from before 2018 after de-
duplication and cleaning
– Removed Wikipedia, since it is commonly used in test sets

• GPT-2 is generatively trained on WebText data and not fine-


tuned on anything else
GPT-2 Architecture and Model Sizes
• Architecture is basically the same as GPT-1 and BERT

GPT-1 Size
BERT Size
GPT-2: Zero shot results

Perplexity (PPL) is 2^entropy; lower is better

• Achieves state-of-art in many tasks without tuning for them

• Performs much worse than state-of-art in summarization and


translation (though can effectively translate word for word)
See many more
examples in the paper
Continued log-linear improvement with model size
Conclusion: “The diversity of tasks
the model is able to perform in a
zero-shot setting suggests that
high-capacity models trained to
maximize the likelihood of a
sufficiently varied text corpus
begin to learn how to perform a
surprising amount of tasks
without the need for explicit
supervision.”
In the OpenAI board room…
GPT-3 (Brown et al. 2020)
Models and Architectures
Training data
Training compute

List price of
compute to train
GPT-3 175B:
~$4.5M
Few-shot “In
Context Learning”
GPT-3

GPT-2 scale
Accuracy on a simple task to remove random symbols from a word
GPT-3
Performance of GPT-3

• Average performance of few-shot is about the same


as fine-tuned BERT-Large, but varies by task
• Per-task specialized SOTA models are still best
GPT-3 Arithmetic
What to learn from the GPT Series
• GPT: generative-pretraining (GPT) is effective for large
language models

• GPT-2: GPT models can perform reasonable zero-shot task


performance with larger models trained on more data

• GPT-3: Even larger GPT models trained on even more data


are good at many tasks, especially text generation, and
can be “trained” at inference time with in-context
examples
What should my students learn from the GPT papers?
ChatGPT says:

The GPT (Generative Pre-trained Transformer) papers are a series of influential papers that introduced and improved upon the architecture of
the transformer model for natural language processing tasks. Here are some of the key concepts and takeaways that your students can learn
from these papers:

1. Transformers: The GPT papers introduce the transformer model, which is a type of neural network architecture that uses self-attention
mechanisms to process sequences of data. Transformers have become a fundamental building block for many state-of-the-art natural
language processing models.

2. Pre-training: The GPT papers demonstrate the effectiveness of pre-training transformer models on large amounts of text data before fine-
tuning them on specific downstream tasks. Pre-training allows the model to learn general language features that can be transferred to
multiple tasks.

3. Language modeling: The GPT papers use language modeling as the pre-training task, where the model learns to predict the next word in a
sequence given the previous words. Language modeling is a fundamental task in natural language processing and is used in many downstream
applications.

4. Transfer learning: The GPT papers demonstrate the power of transfer learning, where a pre-trained model can be fine-tuned on a specific
downstream task with relatively little data. This allows the model to quickly adapt to new tasks and achieve state-of-the-art performance.

5. Hyperparameter tuning: The GPT papers provide insights into the importance of hyperparameter tuning, such as the learning rate, batch
size, and model size, for achieving the best performance on specific tasks.

6. Overall, the GPT papers are a valuable resource for understanding the transformer model and its applications in natural language
processing. Your students can learn about the importance of pre-training, transfer learning, and hyperparameter tuning, as well as gain
insights into the latest state-of-the-art techniques for language modeling and other natural language processing tasks.

Wrong statements in red, good points in green


On the other hand,
There once was a class so great
Applied Machine Learning, first-rate
The students all learned
And their skills were discerned
Now their models can predict with high rate!
– Chat GPT
[two minute break]
How much of our thoughts and conversation are just next
word prediction?
CLIP: Learning Transferrable Models from Natural
Language Supervision (Radford et al. 2021)

First key idea: use a text encoder as a classifier


pdf
First key idea: use a text encoder as a classifier
• This is an old idea – words and pictures work goes back to
~2000, but at a smaller scale

• How to scale?
– Learn from natural language supervision (not tags or class labels)
– Scrape 400 million image/text pairs
– “Bag of words” language representation
– Contrastive objective, instead of predicting exact language
– Use transformer architecture
Second key idea(s): contrastively match gestalt text to image

• Use small transformer


language model (76M
parameters for base)
• Matching task with large
batch (size = 32,768)
– Each image and text from
batch is encoded
– Similarity score obtained
for 32K x 32K image-text
pairings
– Loss is cross-entropy on
matching each image to its
text, and each text to its Contrastive task formulations is a good
image general way to learn when exact target is
unpredictable
Training cost
• “The largest ResNet model RN50x64, took 18 days to train on
592 V100 GPUs, while the largest Vision Transformer took 12
days on 256 V100 GPUs”
– ~$91K for Transformer model; $300K for ResNet model
Key idea 3: zero-shot classification
Every batch of training is like
a novel classification task,
matching 32K classes to 32K
images

To create a new
classification task:
1. Convert class labels into
captions and encode the
text
2. Encode the image
3. Assign the image to the
label whose caption
matches best
Four ways to adapt CLIP to a new task
1. Zero-shot: convert labels to text and use text-image similarity

2. Linear probe: freeze the image encoder and train a linear


layer on its features

3. Nearest neighbor (not in paper): record features of training


examples and use K-NN classifier

4. Fine-tune CLIP encoder for the new task (but then it


completely loses its generality)
Zero shot prediction examples (randomly selected)
• Zero-shot clip performs as
well as a strong baseline
trained on 16 examples
per class
• Linear probe needs 4
examples to reach zero-
shot performance (on
average)
What to remember
• Deep learning application often involves
starting with a pre-trained “foundation”
model and fine-tuning it

• GPT demonstrates that learning to


predict the next word produces a flexible
zero-shot and few-shot general language
task performer

• CLIP shows that learning to match images


to text produces a good zero-shot
classifier and an excellent image encoder
Coming up
• Thursday: exam
– Can come to lecture at 9:30 to ask me questions (other than “what is
on the exam”)
• Next week: spring break!
• After that: Creating ML applications, and impact of AI/ML

You might also like