0% found this document useful (0 votes)
82 views

Jason Wei Stanford cs330 Talk

Okay, here are the step-by-step workings: 1) (8-2)*3 + 4 = 6*3 + 4 = 18 + 4 = 22 2) 22^3 = 22 * 22 * 22 = 9,261,056 3) 9,261,056 / 8 = 1,157,632 The answer is C.

Uploaded by

paramtim
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
82 views

Jason Wei Stanford cs330 Talk

Okay, here are the step-by-step workings: 1) (8-2)*3 + 4 = 6*3 + 4 = 18 + 4 = 22 2) 22^3 = 22 * 22 * 22 = 9,261,056 3) 9,261,056 / 8 = 1,157,632 The answer is C.

Uploaded by

paramtim
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 44

Some intuitions about large language models

Jason Wei

OpenAI

1
Mistakes & opinions my own, and not of my employer.
Fundamental question. Why do large language
models work so well?

Thing I’ve been thinking about recently: Manually


inspecting data gives us clear intuitions about what
we do with our models.

2
Looking at data = training your biological neural net.

Your biological neural net makes many observations


about the data after reading it.

These intuitions can be valuable.

(I once manually annotated an entire lung cancer image classification


dataset. Several papers came out of intuitions from that process.)

3
Talk outline

(1) Next-word (2) Learning (3) Words have


prediction is massively <input, output> different information
multi-task learning. relationships can be densities, so give
cast as next-word language models time
prediction. to think.

(4) Scaling model size (5) Overall loss (6) Large-enough


and data is expected improves smoothly, language models do
to continue improving but individual tasks actually “learn”
loss. can improve suddenly. in-context.

4
Background: Pre-training vs post-training

Pre-training Post-training

Data quantity A lot, ~ the internet Small,


(trillions of words) (Millions of examples?)

Data quality Low High

Goal Most “learning” occurs Makes model usable by


general public

5
Review: language models
(hypothetical)
Word Probability
Pre-training only a 0.00001 Loss = - log P(next word | previous words)
(per word, on an
aardvark 0.000004 unseen test set)
Dartmouth …
students Language drink 0.5 Example. If your loss is 3, then you
like to ___ Model … have a 1/(e^3) probability of getting
study 0.23 the next token right on average.

zucchini 0.000002

The best language model is the one


that best predicts an unseen test
set (i.e., best test loss).

6
Intuition 1.
Next-word prediction on large data can be viewed as
massively multi-task learning.

7
Example tasks from next-word prediction
Task Example sentence in pre-training that would teach that task
Grammar In my free time, I like to {run, banana}
Lexical semantics I went to the zoo to see giraffes, lions, and {zebras, spoon}
World knowledge The capital of Denmark is {Copenhagen, London}
Sentiment analysis Movie review: I was engaged and on the edge of my seat the whole time. The
movie was {good, bad}
Harder sentiment Movie review: Overall, the value I got from the two hours watching it was the
analysis sum total of the popcorn and the drink. The movie was {bad, good}
Translation The word for “pretty” in Spanish is {bonita, hola}
Spatial reasoning [...] Iroh went into the kitchen to make some tea. Standing next to Iroh, Zuko
pondered his destiny. Zuko left the {kitchen, store}
Math question First grade arithmetic exam: 3 + 8 + 4 = {15, 11}

[millions more]

Extreme multi-task learning!


8
There are a lot of possible “tasks”, and they can be arbitrary

Input Target Task


A transformer is a deep learning 2017
factual recall
architecture, initially proposed in
A transformer is a deep learning , comma prediction
architecture, initially proposed in 2017
A transformer is a deep learning that
grammar
architecture, initially proposed in 2017,
A transformer is a deep learning relies
impossible?
architecture, initially proposed in 2017, that
https://en.wikipedia.org/wiki/Transformer_(machine_learning_model)

Being a language model is not easy! A lot of arbitrary words to predict. Tasks
aren’t weird and not clean.
9
Language models are really good next word predictors.

Simple objective, complex environment (all the action happens in the data).

10
Intuition 2.
Next-word prediction is extremely general.

Learning <input, output> relationships can be cast as


next-word prediction.

This is called in-context learning.

11
In-context learning (i.e., few-shot prompting)

Give examples with some delimiters as


part of the input to the language
model.

Optionally, include a task instruction.

Language models are few-shot learners. Brown et al., 2020.

12
Providing examples in context improves performance

Task: remove random symbols


from a word.

Input: s.u!c/c!e.s s i/o/n

Output: succession

Note that performance


increases with number of
examples in context (blue line).

Language models are few-shot learners. Brown et al., 2020.

13
In-context learning is convenient because for the past
decades, we used <input, output> pairs to do ML.

But there is no first-principles reason why we have to


use <input, output> pairs to specify tasks.

When we communicate with humans, we give them


instructions, explanations, and teach them
interactively, in addition to providing examples.

14
Intuition 3.
Words can have very different information density.

Not all words are equal.

15
This token isn’t very hard to guess, and doesn’t provide much
new information.

16
This token takes a lot of work to generate—you have to do the
entire problem to figure it out.

17
Pretend you’re ChatGPT. As soon as you see the
prompt you have to immediately start typing… go!

Question: What is the square of ((8-2)*3+4)^3 / 8?


(A) 1,483,492
(B) 1,395,394
(C) 1,771,561

Tough right?

18
Observation: some problems need more
compute than others.
Maybe one forward pass has enough compute to solve hard problems, in
principle. But in practice, you want to give the language model variable
compute, and in a way that is somewhat similar to the model’s training
distribution.

19
Encourage models to give
reasoning before answering by
Chain-of-thought prompting including reasoning in the
in-context examples.

Question
Question

Chain of thought
Answer

Answer
Unseen input
Unseen input

Chain-of-thought prompting elicits reasoning in large


language models. Wei et al., 2022.

20
Decompose the problem first to reason even more
Decompose the
question into
subquestions

Solve the subquestions


one-by-one.

Least-to-most-prompting enables complex reasoning in large language models. Zhou et al., 2022. 21
Problem decomposition may one day allow AI to solve
challenging problems
Prompt Hypothetical response

Write a research proposal about the best Let me first decompose this problem into several
approaches for reducing global warming steps and substeps.

1. First I’ll read the literature to understand the


best existing scenarios…
1a. …
1b. …
[...]

2. Based on the above, the biggest gaps in the


current approaches are [...]

3. To solve these gaps, we first need to study and


understanding the following…

[...so on…]

22
Intuition 4.
Scaling language models (size * data = compute) is expected to
continue improving loss.

23
Perspectives on scaling
10^25

Remember “moles” from high


PaLM-1
10^24 school chemistry? 6e23 6k TPU V4 for 2 months

“as many elementary entities


Training as there are atoms in 0.012 GPT-3
10^23 kilogram of carbon 12” 1,000x BERT-large
compute in
FLOPs T5 11B
(log scale) 10^22 ~100x BERT-large

10^21

BERT-large
10^20 16 V100 GPU for 33 hrs

2018 2019 2020 2021 2022 2023


24
Scaling predictably improves performance (“scaling laws”)
Scaling laws for neural language models. Kaplan et al., 2020.
Kaplan et al., 2020:
“Language modeling
Increase performance improves
compute
smoothly as we increase
Loss goes the model size, dataset
down
size, and amount of
compute for training.”

Jason’s rephrase: You should expect to


get a better language model if you
scale up compute.

Seven orders of magnitude

25
Why does scaling work? Hard to confirm, but just some guesses

Small language model Large language model

Memorization is costly More generous with memorizing tail


knowledge
“Parameters are scarce, so I have to decide which
facts are worth memorizing” “I have a lot of parameters so I’ll just memorize all
the facts, no worries”

First-order correlations Complex heuristics


“Wow, that token was hard. It was hard enough for “Wow, I got that one wrong. Maybe there’s
me to even get it in the top-10 predictions. Just something complicated going on here, let me try to
trying to predict reasonable stuff, I’m not destined figure it out. I want to be the GOAT.”
for greatness.”

26
Intuition 5.
While overall loss scales smoothly, individual downstream tasks
may scale in an emergent fashion.

27
Take a closer look at loss. Consider:

Overall loss = 1e-10 * loss_grammar +


1e-10 * loss_sentiment_analysis +
1e-10 * loss_world_knowledge + If loss goes from 4 to 3, do
all tasks get better
… uniformly? Probably not.
1e-10 * loss_math_ability +
1e-10 * loss_spatial_reasoning

28
202 downstream tasks in BIG-Bench
Smoothly Emergent abilities (33%)
increasing
(29%)

Not correlated
with scale (13%)

Flat Inverse scaling (2.5%)


(22%) Performance decreases with scale

29
Emergence in science
General defn. in science

Emergence is a qualitative change that


arises from quantitative changes.

Popularized by this 1972 piece Given only small molecules such as calcium,
With a bit of uranium, nothing special
by Nobel-Prize winning happens. With a large amount of you can’t meaningfully encode useful
uranium, you get a nuclear reaction. information. Given larger molecules such as
physicist P.W. Anderson.
DNA, you can encode a genome.

Future ML systems will be qualitatively different (2022).


30
Emergence in large language models

Large LM definition
An ability is emergent if it is
not present in smaller models, but
is present in larger models.
Emergent abilities of large language models. Wei et al., 2022.

Model size Training data Training compute Why is this not everything?
Data quality, optimization objective, model architecture,
(# parameters) (# tokens) (FLOPs) post-training techniques

Model Training Training


x =
size tokens compute

31
Emergence in few-shot prompting: examples

y-axis is
performance
on the task

Performance is flat for small models.


Performance spikes to well
above-random for large models.

Open research question: is it possible to predict


emergence using only smaller model sizes?

x-axis is “scale”

32
Emergence in prompting: example

Prompt

Input (English): I like to


play soccer and tennis curie
BLEU Me gusta jugar al fútbol y
Target (Spanish): score al tenis

Model “curie” suddenly


figures out to translate
0 Model and not repeat.
scale

ada babbage

I like to play soccer and I like to play soccer and


tennis tennis

33
Benefit from chain-of-thought is emergent
Prompt
Q: Roger has 5 tennis balls. He buys 2 more cans of tennis balls. Each can
has 3 tennis balls. How many tennis balls does he have now?
A: Let's think step by step. Roger started with 5 balls. 2 cans of 3 tennis
balls each is 6 tennis balls. 5+6=11. The answer is 11.

Q: What is half of (3 + 7) plus one?


A:

text-ada-001 (small LM) text-davinci-002 (large LM)

CoT hurts CoT helps


The answer is the result of Let's think step by step.
performance performance adding 1 more ball (3 + 7) 3+7=10. 10/2=5. 5+1=6. The
plus 1. answer is 6.
Challenging BIG-Bench tasks and whether
chain-of-thought can solve them. Suzgun
et al., 2022.

34
Three implications of emergence

Unpredictable. Unintentional. One model, many-tasks.


Emergence cannot be Emergent abilities are Since scaling has
predicted solely by not explicitly specified unlocked emergent
extrapolating scaling by the trainer of the abilities, further scaling
curves from smaller language model (next can be expected to elicit
models. word prediction “only”). more abilities.

Scaling laws don’t apply for In the history of deep learning, Let’s scale more right? (Any
downstream tasks! has this been true before? undesirable emergent abilities?)

Suggested further reading:


Emergent deception and emergent optimization.
35
Intuition 6.
Why would providing examples in-context improve performance?

The “shallow” way is that it tells models about formatting or the


output space.

The more “profound” way is that models actually learn


in-context the relationship between inputs and outputs.

Large-enough models have signs of life for the profound way.

36
The shuffled label setting

Input: Great movie Output: positive Randomly shuffle these


labels so that there is
Input: Really boring Output: negative no correlation between
inputs and their
Input: Amazing performance Output: positive corresponding labels.
Input: Contains no wit Output: negative

[...]

Input: Loved it Output:

37
For GPT-3, shuffled labels barely hurts performance
In the shuffled labels setting,
in-context learning
performance barely drops.

This means the model


doesn’t look at <input,
output> mappings.

Not so fast…

Rethinking the role of demonstrations: what makes in-context learning work?


Min et al., 2021.

38
The flipped label setting

Input: Great movie Output: positive negative Flip all labels in the
examples.
Input: Really boring Output: negative positive
If model looks at <input,
Input: Amazing performance Output: positive negative output> relationships, then
it should guess the exact
Input: Contains no wit Output: negative positive opposite of the real
answer. So performance is
[...] ideally 0. (Random
guessing should be 50%
Input: Loved it Output: for binary classification.)

39
Flipped labels affects performance only for large models

In the “flipped-label” setting, small


models see no performance drop.

On the other hand, large models learn


to follow the flipped labels (accuracy
reverses).

Larger language models do in-context learning differently. Wei et al., 2023.

40
The semantically unrelated labels setting

Input: Great movie Output: positive foo Use semantically unrelated labels.

Input: Really boring Output: negative bar Now the model must look at the
<input, output> relationships to
Input: Amazing performance Output: positive foo figure out what the task is.

Input: Contains no wit Output: negative bar

[...]

Input: Loved it Output:

41
Large models don’t need semantic labels to figure out
the task
Small models have a large
performance drop. They need the
semantic labels.

Large models have a minimal


performance drop. They can figure out
the task even without semantic labels.

Larger language models do in-context learning differently. Wei et al., 2023.

42
Large LM intuition General idea

(3) Words have different Most data has varying


information densities, so give information densities, so
language models time to think. adaptive compute can help.

(4) Scaling model size and data is Plot scaling curves to see if
expected to continue improving doing more of something will be
loss. a good strategy.

(5) Overall loss improves To better understand aggregate


smoothly, but individual tasks metrics, decompose them into
can improve suddenly. individual categories.

43
Thanks.
X / Twitter: @_jasonwei

I’ve love your feedback on this talk: https://tinyurl.com/jasonwei

44

You might also like