Jason Wei Stanford cs330 Talk
Jason Wei Stanford cs330 Talk
Jason Wei
OpenAI
1
Mistakes & opinions my own, and not of my employer.
Fundamental question. Why do large language
models work so well?
2
Looking at data = training your biological neural net.
3
Talk outline
4
Background: Pre-training vs post-training
Pre-training Post-training
5
Review: language models
(hypothetical)
Word Probability
Pre-training only a 0.00001 Loss = - log P(next word | previous words)
(per word, on an
aardvark 0.000004 unseen test set)
Dartmouth …
students Language drink 0.5 Example. If your loss is 3, then you
like to ___ Model … have a 1/(e^3) probability of getting
study 0.23 the next token right on average.
…
zucchini 0.000002
6
Intuition 1.
Next-word prediction on large data can be viewed as
massively multi-task learning.
7
Example tasks from next-word prediction
Task Example sentence in pre-training that would teach that task
Grammar In my free time, I like to {run, banana}
Lexical semantics I went to the zoo to see giraffes, lions, and {zebras, spoon}
World knowledge The capital of Denmark is {Copenhagen, London}
Sentiment analysis Movie review: I was engaged and on the edge of my seat the whole time. The
movie was {good, bad}
Harder sentiment Movie review: Overall, the value I got from the two hours watching it was the
analysis sum total of the popcorn and the drink. The movie was {bad, good}
Translation The word for “pretty” in Spanish is {bonita, hola}
Spatial reasoning [...] Iroh went into the kitchen to make some tea. Standing next to Iroh, Zuko
pondered his destiny. Zuko left the {kitchen, store}
Math question First grade arithmetic exam: 3 + 8 + 4 = {15, 11}
[millions more]
Being a language model is not easy! A lot of arbitrary words to predict. Tasks
aren’t weird and not clean.
9
Language models are really good next word predictors.
Simple objective, complex environment (all the action happens in the data).
10
Intuition 2.
Next-word prediction is extremely general.
11
In-context learning (i.e., few-shot prompting)
12
Providing examples in context improves performance
Output: succession
13
In-context learning is convenient because for the past
decades, we used <input, output> pairs to do ML.
14
Intuition 3.
Words can have very different information density.
15
This token isn’t very hard to guess, and doesn’t provide much
new information.
16
This token takes a lot of work to generate—you have to do the
entire problem to figure it out.
17
Pretend you’re ChatGPT. As soon as you see the
prompt you have to immediately start typing… go!
Tough right?
18
Observation: some problems need more
compute than others.
Maybe one forward pass has enough compute to solve hard problems, in
principle. But in practice, you want to give the language model variable
compute, and in a way that is somewhat similar to the model’s training
distribution.
19
Encourage models to give
reasoning before answering by
Chain-of-thought prompting including reasoning in the
in-context examples.
Question
Question
Chain of thought
Answer
Answer
Unseen input
Unseen input
20
Decompose the problem first to reason even more
Decompose the
question into
subquestions
Least-to-most-prompting enables complex reasoning in large language models. Zhou et al., 2022. 21
Problem decomposition may one day allow AI to solve
challenging problems
Prompt Hypothetical response
Write a research proposal about the best Let me first decompose this problem into several
approaches for reducing global warming steps and substeps.
[...so on…]
22
Intuition 4.
Scaling language models (size * data = compute) is expected to
continue improving loss.
23
Perspectives on scaling
10^25
10^21
BERT-large
10^20 16 V100 GPU for 33 hrs
25
Why does scaling work? Hard to confirm, but just some guesses
26
Intuition 5.
While overall loss scales smoothly, individual downstream tasks
may scale in an emergent fashion.
27
Take a closer look at loss. Consider:
28
202 downstream tasks in BIG-Bench
Smoothly Emergent abilities (33%)
increasing
(29%)
Not correlated
with scale (13%)
29
Emergence in science
General defn. in science
Popularized by this 1972 piece Given only small molecules such as calcium,
With a bit of uranium, nothing special
by Nobel-Prize winning happens. With a large amount of you can’t meaningfully encode useful
uranium, you get a nuclear reaction. information. Given larger molecules such as
physicist P.W. Anderson.
DNA, you can encode a genome.
Large LM definition
An ability is emergent if it is
not present in smaller models, but
is present in larger models.
Emergent abilities of large language models. Wei et al., 2022.
Model size Training data Training compute Why is this not everything?
Data quality, optimization objective, model architecture,
(# parameters) (# tokens) (FLOPs) post-training techniques
31
Emergence in few-shot prompting: examples
y-axis is
performance
on the task
x-axis is “scale”
32
Emergence in prompting: example
Prompt
ada babbage
33
Benefit from chain-of-thought is emergent
Prompt
Q: Roger has 5 tennis balls. He buys 2 more cans of tennis balls. Each can
has 3 tennis balls. How many tennis balls does he have now?
A: Let's think step by step. Roger started with 5 balls. 2 cans of 3 tennis
balls each is 6 tennis balls. 5+6=11. The answer is 11.
34
Three implications of emergence
Scaling laws don’t apply for In the history of deep learning, Let’s scale more right? (Any
downstream tasks! has this been true before? undesirable emergent abilities?)
36
The shuffled label setting
[...]
37
For GPT-3, shuffled labels barely hurts performance
In the shuffled labels setting,
in-context learning
performance barely drops.
Not so fast…
38
The flipped label setting
Input: Great movie Output: positive negative Flip all labels in the
examples.
Input: Really boring Output: negative positive
If model looks at <input,
Input: Amazing performance Output: positive negative output> relationships, then
it should guess the exact
Input: Contains no wit Output: negative positive opposite of the real
answer. So performance is
[...] ideally 0. (Random
guessing should be 50%
Input: Loved it Output: for binary classification.)
39
Flipped labels affects performance only for large models
40
The semantically unrelated labels setting
Input: Great movie Output: positive foo Use semantically unrelated labels.
Input: Really boring Output: negative bar Now the model must look at the
<input, output> relationships to
Input: Amazing performance Output: positive foo figure out what the task is.
[...]
41
Large models don’t need semantic labels to figure out
the task
Small models have a large
performance drop. They need the
semantic labels.
42
Large LM intuition General idea
(4) Scaling model size and data is Plot scaling curves to see if
expected to continue improving doing more of something will be
loss. a good strategy.
43
Thanks.
X / Twitter: @_jasonwei
44