Cs224u Intro 2023 Handout
Cs224u Intro 2023 Handout
Christopher Potts
CS224u: Natural Language Understanding
Our team
• Kawin Ethayarajh: Evaluation in NLP
• Sidd Karamcheti: Robot learning and NLP, scaling
• Mina Lee: Human–AI Interaction, HCI, LLMs
• Siyan Li: Human-centered NLP; distinguished course alum
• Lisa Li: Diffusion models, prefix tuning, in-context learning
• Tolúlope Ògúnremí: Multilingual and low-resource NLP
• Tianyi Zhang: LLMs, emergence
2
3
2012
4
2012
5
2012
2022
6
A golden age for NLU
7
Which U.S. states border no U.S. states?
8
1980
9
1980
10
1980
2009
11
1980
2009
2020
Which country bordering the Mediterranean
borders a country that is bordered by a country
whose population exceeds the population of
India? turkey.
12
1980
2009
2020
Which country bordering the Mediterranean
borders a country that is bordered by a country 2020
whose population exceeds the population of
India? turkey.
13
1980
2009
2020
Which country bordering the Mediterranean
borders a country that is bordered by a country 2020
whose population exceeds the population of
2021
India? turkey.
14
1980
2009
2020
Which country bordering the Mediterranean
borders a country that is bordered by a country 2020
whose population exceeds the population of
2021
India? turkey.
2022
15
1980
2009
2020
Which country bordering the Mediterranean
borders a country that is bordered by a country 2020
whose population exceeds the population of
2021
India? turkey.
2022
16
OpenAI GPT-3
17
Spotting modelsʼ “cheap tricks”
18
OpenAI GPT-3
Levesque 2013
19
20
Benchmarks saturate faster than ever
23
AI model development past and present
24
AI model development past and present
25
AI model development past and present
26
AI model development past and present
Task parameters
27
AI model development past and present
28
The Transformer
29
Self-supervision
1. The modelʼs only objective is to learn co-occurrence patterns in
the sequences it is trained on.
2. Alternatively: to assign high probability to attested sequences.
3. Generation then involves sampling from the model.
4. The sequences can contain anything.
30
Large-scale pretraining
31
Model size PaLM
1T
Megatron-Turing NLG
10B
Megatron (11B)
Megatron (8.3B)
1B GPT-2
BERT
100M GPT
Loose collectives
33
Academic
Start-ups
Model size PaLM
1T
Megatron-Turing NLG
10B
Megatron (11B) FLAN T5 XXL (Google; 11B)
35
The GPT-3 paper and the rise of in-context learning
36
Pure self-supervision vs. regular supervision
Standard supervision for “Few-shot in-context learning”
nervous anticipation
My palms started to sweat nervous Hey model, here is an example of nervous
Supervision
as the lotto numbers were for nervous anticipationanticipation: “My palms started to sweat as
anticipation
read off. = 1
the lotto numbers were read off.”
I took a deep breath as the nervous
anticipation Hey model, here’s an example without
curtain started to rise on my
debut night. = 1 nervous anticipation: “...”
37
Learning from human feedback
39
Old-school prompting style (so 2021)
43
High-level overview
Topics Work
1. Contextual representations 1. 3 assignment/bakeoff combos
2. Multi-domain sentiment analysis 2. 3 offline quizzes
3. Retrieval-augmented in-context
learning 3. Final project:
4. Compositional generalization a. Lit review
5. Benchmarking and adversarial b. Experiment protocol
training and testing
c. Final paper
6. Model introspection
7. Methods and metrics
44
Background materials
• CS224n is a prerequisite for this course, so we are going to skip
a lot of the fundamentals we have covered in past years.
• If you need a refresher, check out the background page of the
course site:
• Fundamentals of scientific computing in AI
• Static vector representations
• Supervised learning
45
Core goals
• Hands-on experience with a wide range of challenging NLU
problems.
• A mentor from the teaching team will guide you through the
project assignments – there are many examples of these projects
becoming important publications.
• Central goal: to make you the best – most insightful and
responsible – NLU researcher and practitioner wherever you go
next.
46
Course theme
Transformer-based pretraining
47
Progression and exploration for Transformers
1. Core concepts and goals
2. Architectures
3. Positional encoding
4. Distillation
5. Diffusion objectives [Lisa!]
6. Practical pretraining and fine-tuning [Sidd!]
48
49
Course theme
50
LLMs for everything Retrieval-augmented
51
What do we need?
• Synthesis/Fluency 🤠
• Efficiency
• Updateability
• Provenance/Factualness
• Safety/Security
52
Efficiency
LLMs for everything 😬 Retrieval-augmented 🤠
Smaller LMs
• Cheaper to develop
• Cheaper to maintain
• Cheaper to deploy
Managing large search
indices is a familiar problem.
53
Updateability
LLMs for everything
�� Retrieval-augmented
🤠
56
Safety and security
Retrieval augmented ��
Access restrictions
imposed at the
document level in a
familiar way.
57
What do we need?
LLMs for everything Retrieval-augmented
• Synthesis/Fluency �� ��
• Efficiency �� ��
• Updateability �� ��
• Provenance/Factualness �� ��
• Safety/Security �� ��
58
recent past?
The present: Wrangling pretrained components
Task parameters
59
Models can communicate in natural language
60
Few-shot OpenQA
What is the course to take?
Sampled train
p1: Pragmatics is the study of language use. D
D = random
q1: What is pragmatics?
D = IR.kNN(Q)
a1: The study of language use
pi = IR.retrieve(qi, k=1)
p2: Bert is a Muppet who is lives with Ernie.
Hindsight pi = IR.retrieve(qi+a1, k=1)
q2: Who is Bert?
P = IR.retrieve(qi, k=5)
a2: Bert is a Muppet pi = LM.argmaxp∊P (ai | qi, p)
Compositional generalization
64
The COGS challenge
66 Wu et al. 2023
ReCOGS
67 Wu et al. 2023
ReCOGS remains challenging
68 Wu et al. 2023
69
Course theme
70
Water and air of our field
Jacques Cousteau: “Water and air,
the two essential fluids on which
all life depends, have become
global garbage cans.”
71
We ask a lot of our datasets
1. Optimize models
2. Evaluate models
3. Compare models
4. Enable new capabilities in models
5. Measure fieldwide progress
6. Scientific inquiry
72
What does benchmark saturation really mean?
74
Course theme
75
Strathernʼs Law:
When a measure becomes a target,
it ceases to be a good measure
76
What we seem to value Performance
Efficiency
Interpretability (for researchers)
Applicability in the real world
Selected ʻValues
Robustness
encoded in ML Scalability
researchʼ from Interpretability (for users)
Benificence
Birhane et al. 2021
Privacy
Fairness
77
Justice
Towards multidimensional leaderboards
78
Dynascoring
81
Course theme
82
83
Trust
Reliability Bias
Analytic guarantees
about model behaviors
84
Standards for explanation
Human interpretable Faithful
We can give mechanistic, mathematical We can give human interpretable
explanations of how models work that are explanations that are not true to how our
perfectly faithful and accurate. models actually work.
However, these explanations fail to These can can seem satisfying, but if we
illuminate the concepts we care about. canʼt guarantee that they are faithful to
how the models actually work, we are
simply confusing ourselves.
87
1980 2023
89
Core course components
Quizzes 15%
Homeworks and bakeoffs 35%
Literature review 10%
Experiment protocol 10%
Final project paper 30%
90
Fully asynchronous
• All lectures are recorded, and attendance is not required.
• Attending lectures is a great way to participate in shaping the
course and build connections with the teaching team.
• Office hours are offered in person and on Zoom; details to
come.
• Continuous evaluation: three assignments, four online quizzes,
and three components to the project work.
91
A note on grading original systems
All the homeworks culminate in an “original system” question that becomes your bakeoff
entry. Here are the basic guidelines we will adopt for grading this work:
1. We want to emphasize that this needs to be an original system. It doesnʼt suffice to
download code from the Web, retrain, and submit, even if this leads to an outstanding
bakeoff score. You can build on othersʼ code, but you have to do something new and
meaningful with it.
2. Systems that are very creative and well-motivated will be given full credit even if they do
not perform well on the bakeoff data. We want to encourage creative exploration!
3. Other systems will receive less than full credit, based on the judgment of the teaching
team. The specific criteria will vary based on the nature of the assignment. Point
deductions will be justified in feedback.
92
Project work
1. The second half of the course is devoted to projects.
2. The associated lectures, notebooks, and readings are focused
on methods, metrics, and best practices.
3. The assignments are all project-related; details are available at
the course website
4. Exceptional final projects from past years (access restricted)
5. Lots of guidance on projects
93
Crucial course links
• Website
• Code repository
• Discussion forum
• Gradescope
• Teaching team: cs224u-spr2223-staff@lists.stanford.edu
94
Quizzes
1. Quiz 0 is on course requirements and related details. The sole
purpose of the quiz is to create a clear incentive for you to study
the website and understand your rights and obligations.
2. Quizzes 1–4 create a course-related incentive for individual
students to study the material beyond what is required for the
more free-form and collaborative assignments.
3. All quizzes are open notes, open book, open ChatGPT, etc., but
95
no collaboration is permitted.
For next time
1. Get set up using setup.ipynb in the course repo.
2. Make sure youʼre in the discussion forum. If not, follow the link
3. given at the homepage for our course Canvas.
4. Consider doing Quiz 0 as a way of getting to know your rights
and obligations for this course.
5. Check out hw_sentiment.ipynb. If this material is new to you or
you need a refresher, check out the background materials.
96
Computing resources
1. We expect to get you AWS credits.
2. Consider getting a Colab Pro account; at $9.99/month, a
three-month subscription is cheaper than even the cheapest
textbooks.
3. Sign up for SageMaker Studio Lab for additional free GPU support.
4. Sign up for Cohere for (for now) free access to outstanding language
models (and OpenAI still offers $5 in credits for new accounts).
97
Core goals (repeated from above)
• Hands-on experience with a wide range of challenging NLU
problems.
• A mentor from the teaching team will guide you through the
project assignments – there are many examples of these projects
becoming important publications.
• Central goal: to make you the best – most insightful and
responsible – NLU researcher and practitioner wherever you go
next.
98