Lecture 15 - Foundation Models - CLIP and GPT
Lecture 15 - Foundation Models - CLIP and GPT
Models
Applied Machine Learning
Derek Hoiem
Dall-E
Last class: Transformer Models
The dominant approach to creating ML systems is to collect a dataset of training examples demonstrating correct
behavior for a desired task, train a system to imitate these behaviors, and then test its performance on independent
and identically distributed (IID) held-out examples. This has served well to make progress on narrow experts. But
the often erratic behavior of captioning models (Lake et al., 2017), reading comprehension systems (Jia & Liang, 2017),
and image classifiers (Alcorn et al., 2018) on the diversity and variety of possible inputs highlights some of the
shortcomings of this approach.
Our suspicion is that the prevalence of single task training on single domain datasets is a major contributor to the lack
of generalization observed in current systems. Progress towards robust systems with current architectures is likely
to require training and measuring performance on a wide range of domains and tasks.”
PDF
GPT-2
• A general systems should learn to model
𝑃𝑃(𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜|𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖, 𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡)
GPT-1 Size
BERT Size
GPT-2: Zero shot results
List price of
compute to train
GPT-3 175B:
~$4.5M
Few-shot “In
Context Learning”
GPT-3
GPT-2 scale
Accuracy on a simple task to remove random symbols from a word
GPT-3
Performance of GPT-3
The GPT (Generative Pre-trained Transformer) papers are a series of influential papers that introduced and improved upon the architecture of
the transformer model for natural language processing tasks. Here are some of the key concepts and takeaways that your students can learn
from these papers:
1. Transformers: The GPT papers introduce the transformer model, which is a type of neural network architecture that uses self-attention
mechanisms to process sequences of data. Transformers have become a fundamental building block for many state-of-the-art natural
language processing models.
2. Pre-training: The GPT papers demonstrate the effectiveness of pre-training transformer models on large amounts of text data before fine-
tuning them on specific downstream tasks. Pre-training allows the model to learn general language features that can be transferred to
multiple tasks.
3. Language modeling: The GPT papers use language modeling as the pre-training task, where the model learns to predict the next word in a
sequence given the previous words. Language modeling is a fundamental task in natural language processing and is used in many downstream
applications.
4. Transfer learning: The GPT papers demonstrate the power of transfer learning, where a pre-trained model can be fine-tuned on a specific
downstream task with relatively little data. This allows the model to quickly adapt to new tasks and achieve state-of-the-art performance.
5. Hyperparameter tuning: The GPT papers provide insights into the importance of hyperparameter tuning, such as the learning rate, batch
size, and model size, for achieving the best performance on specific tasks.
6. Overall, the GPT papers are a valuable resource for understanding the transformer model and its applications in natural language
processing. Your students can learn about the importance of pre-training, transfer learning, and hyperparameter tuning, as well as gain
insights into the latest state-of-the-art techniques for language modeling and other natural language processing tasks.
• How to scale?
– Learn from natural language supervision (not tags or class labels)
– Scrape 400 million image/text pairs
– “Bag of words” language representation
– Contrastive objective, instead of predicting exact language
– Use transformer architecture
Second key idea(s): contrastively match gestalt text to image
To create a new
classification task:
1. Convert class labels into
captions and encode the
text
2. Encode the image
3. Assign the image to the
label whose caption
matches best
Four ways to adapt CLIP to a new task
1. Zero-shot: convert labels to text and use text-image similarity