0% found this document useful (0 votes)

82 views

Jason Wei Stanford cs330 Talk

Okay, here are the step-by-step workings: 1) (8-2)*3 + 4 = 6*3 + 4 = 18 + 4 = 22 2) 22^3 = 22 * 22 * 22 = 9,261,056 3) 9,261,056 / 8 = 1,157,632 The answer is C.

Uploaded by

paramtim

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

82 views

Jason Wei Stanford cs330 Talk

Okay, here are the step-by-step workings: 1) (8-2)*3 + 4 = 6*3 + 4 = 18 + 4 = 22 2) 22^3 = 22 * 22 * 22 = 9,261,056 3) 9,261,056 / 8 = 1,157,632 The answer is C.

Uploaded by

paramtim

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 44

Some intuitions about large language models

Jason Wei

OpenAI

1
Mistakes & opinions my own, and not of my employer.
Fundamental question. Why do large language
models work so well?

Thing I’ve been thinking about recently: Manually

inspecting data gives us clear intuitions about what
we do with our models.

2
Looking at data = training your biological neural net.

Your biological neural net makes many observations

about the data after reading it.

These intuitions can be valuable.

(I once manually annotated an entire lung cancer image classification

dataset. Several papers came out of intuitions from that process.)

3
Talk outline

(1) Next-word (2) Learning (3) Words have

prediction is massively <input, output> different information
multi-task learning. relationships can be densities, so give
cast as next-word language models time
prediction. to think.

(4) Scaling model size (5) Overall loss (6) Large-enough

and data is expected improves smoothly, language models do
to continue improving but individual tasks actually “learn”
loss. can improve suddenly. in-context.

4
Background: Pre-training vs post-training

Pre-training Post-training

Data quantity A lot, ~ the internet Small,

(trillions of words) (Millions of examples?)

Data quality Low High

Goal Most “learning” occurs Makes model usable by

general public

5
Review: language models
(hypothetical)
Word Probability
Pre-training only a 0.00001 Loss = - log P(next word | previous words)
(per word, on an
aardvark 0.000004 unseen test set)
Dartmouth …
students Language drink 0.5 Example. If your loss is 3, then you
like to ___ Model … have a 1/(e^3) probability of getting
study 0.23 the next token right on average.
…
zucchini 0.000002

The best language model is the one

that best predicts an unseen test
set (i.e., best test loss).

6
Intuition 1.
Next-word prediction on large data can be viewed as
massively multi-task learning.

7
Example tasks from next-word prediction
Task Example sentence in pre-training that would teach that task
Grammar In my free time, I like to {run, banana}
Lexical semantics I went to the zoo to see giraffes, lions, and {zebras, spoon}
World knowledge The capital of Denmark is {Copenhagen, London}
Sentiment analysis Movie review: I was engaged and on the edge of my seat the whole time. The
movie was {good, bad}
Harder sentiment Movie review: Overall, the value I got from the two hours watching it was the
analysis sum total of the popcorn and the drink. The movie was {bad, good}
Translation The word for “pretty” in Spanish is {bonita, hola}
Spatial reasoning [...] Iroh went into the kitchen to make some tea. Standing next to Iroh, Zuko
pondered his destiny. Zuko left the {kitchen, store}
Math question First grade arithmetic exam: 3 + 8 + 4 = {15, 11}

[millions more]

Extreme multi-task learning!

8
There are a lot of possible “tasks”, and they can be arbitrary

Input Target Task

A transformer is a deep learning 2017
factual recall
architecture, initially proposed in
A transformer is a deep learning , comma prediction
architecture, initially proposed in 2017
A transformer is a deep learning that
grammar
architecture, initially proposed in 2017,
A transformer is a deep learning relies
impossible?
architecture, initially proposed in 2017, that
https://en.wikipedia.org/wiki/Transformer_(machine_learning_model)

Being a language model is not easy! A lot of arbitrary words to predict. Tasks
aren’t weird and not clean.
9
Language models are really good next word predictors.

Simple objective, complex environment (all the action happens in the data).

10
Intuition 2.
Next-word prediction is extremely general.

Learning <input, output> relationships can be cast as

next-word prediction.

This is called in-context learning.

11
In-context learning (i.e., few-shot prompting)

Give examples with some delimiters as

part of the input to the language
model.

Optionally, include a task instruction.

Language models are few-shot learners. Brown et al., 2020.

12
Providing examples in context improves performance

Task: remove random symbols

from a word.

Input: s.u!c/c!e.s s i/o/n

Output: succession

Note that performance

increases with number of
examples in context (blue line).

Language models are few-shot learners. Brown et al., 2020.

13
In-context learning is convenient because for the past
decades, we used <input, output> pairs to do ML.

But there is no first-principles reason why we have to

use <input, output> pairs to specify tasks.

When we communicate with humans, we give them

instructions, explanations, and teach them
interactively, in addition to providing examples.

14
Intuition 3.
Words can have very different information density.

Not all words are equal.

15
This token isn’t very hard to guess, and doesn’t provide much
new information.

16
This token takes a lot of work to generate—you have to do the
entire problem to figure it out.

17
Pretend you’re ChatGPT. As soon as you see the
prompt you have to immediately start typing… go!

Question: What is the square of ((8-2)*3+4)^3 / 8?

(A) 1,483,492
(B) 1,395,394
(C) 1,771,561

Tough right?

18
Observation: some problems need more
compute than others.
Maybe one forward pass has enough compute to solve hard problems, in
principle. But in practice, you want to give the language model variable
compute, and in a way that is somewhat similar to the model’s training
distribution.

19
Encourage models to give
reasoning before answering by
Chain-of-thought prompting including reasoning in the
in-context examples.

Question
Question

Chain of thought
Answer

Answer
Unseen input
Unseen input

Chain-of-thought prompting elicits reasoning in large

language models. Wei et al., 2022.

20
Decompose the problem first to reason even more
Decompose the
question into
subquestions

Solve the subquestions

one-by-one.

Least-to-most-prompting enables complex reasoning in large language models. Zhou et al., 2022. 21
Problem decomposition may one day allow AI to solve
challenging problems
Prompt Hypothetical response

Write a research proposal about the best Let me first decompose this problem into several
approaches for reducing global warming steps and substeps.

1. First I’ll read the literature to understand the

best existing scenarios…
1a. …
1b. …
[...]

2. Based on the above, the biggest gaps in the

current approaches are [...]

3. To solve these gaps, we first need to study and

understanding the following…

[...so on…]

22
Intuition 4.
Scaling language models (size * data = compute) is expected to
continue improving loss.

23
Perspectives on scaling
10^25

Remember “moles” from high

PaLM-1
10^24 school chemistry? 6e23 6k TPU V4 for 2 months

“as many elementary entities

Training as there are atoms in 0.012 GPT-3
10^23 kilogram of carbon 12” 1,000x BERT-large
compute in
FLOPs T5 11B
(log scale) 10^22 ~100x BERT-large

10^21

BERT-large
10^20 16 V100 GPU for 33 hrs

2018 2019 2020 2021 2022 2023

24
Scaling predictably improves performance (“scaling laws”)
Scaling laws for neural language models. Kaplan et al., 2020.
Kaplan et al., 2020:
“Language modeling
Increase performance improves
compute
smoothly as we increase
Loss goes the model size, dataset
down
size, and amount of
compute for training.”

Jason’s rephrase: You should expect to

get a better language model if you
scale up compute.

Seven orders of magnitude

25
Why does scaling work? Hard to confirm, but just some guesses

Small language model Large language model

Memorization is costly More generous with memorizing tail

knowledge
“Parameters are scarce, so I have to decide which
facts are worth memorizing” “I have a lot of parameters so I’ll just memorize all
the facts, no worries”

First-order correlations Complex heuristics

“Wow, that token was hard. It was hard enough for “Wow, I got that one wrong. Maybe there’s
me to even get it in the top-10 predictions. Just something complicated going on here, let me try to
trying to predict reasonable stuff, I’m not destined figure it out. I want to be the GOAT.”
for greatness.”

26
Intuition 5.
While overall loss scales smoothly, individual downstream tasks
may scale in an emergent fashion.

27
Take a closer look at loss. Consider:

Overall loss = 1e-10 * loss_grammar +

1e-10 * loss_sentiment_analysis +
1e-10 * loss_world_knowledge + If loss goes from 4 to 3, do
all tasks get better
… uniformly? Probably not.
1e-10 * loss_math_ability +
1e-10 * loss_spatial_reasoning
…

28
202 downstream tasks in BIG-Bench
Smoothly Emergent abilities (33%)
increasing
(29%)

Not correlated
with scale (13%)

Flat Inverse scaling (2.5%)

(22%) Performance decreases with scale

29
Emergence in science
General defn. in science

Emergence is a qualitative change that

arises from quantitative changes.

Popularized by this 1972 piece Given only small molecules such as calcium,
With a bit of uranium, nothing special
by Nobel-Prize winning happens. With a large amount of you can’t meaningfully encode useful
uranium, you get a nuclear reaction. information. Given larger molecules such as
physicist P.W. Anderson.
DNA, you can encode a genome.

Future ML systems will be qualitatively different (2022).

30
Emergence in large language models

Large LM definition
An ability is emergent if it is
not present in smaller models, but
is present in larger models.
Emergent abilities of large language models. Wei et al., 2022.

Model size Training data Training compute Why is this not everything?
Data quality, optimization objective, model architecture,
(# parameters) (# tokens) (FLOPs) post-training techniques

Model Training Training

x =
size tokens compute

31
Emergence in few-shot prompting: examples

y-axis is
performance
on the task

Performance is flat for small models.

Performance spikes to well
above-random for large models.

Open research question: is it possible to predict

emergence using only smaller model sizes?

x-axis is “scale”

32
Emergence in prompting: example

Prompt

Input (English): I like to

play soccer and tennis curie
BLEU Me gusta jugar al fútbol y
Target (Spanish): score al tenis

Model “curie” suddenly

figures out to translate
0 Model and not repeat.
scale

ada babbage

I like to play soccer and I like to play soccer and

tennis tennis

33
Benefit from chain-of-thought is emergent
Prompt
Q: Roger has 5 tennis balls. He buys 2 more cans of tennis balls. Each can
has 3 tennis balls. How many tennis balls does he have now?
A: Let's think step by step. Roger started with 5 balls. 2 cans of 3 tennis
balls each is 6 tennis balls. 5+6=11. The answer is 11.

Q: What is half of (3 + 7) plus one?

text-ada-001 (small LM) text-davinci-002 (large LM)

CoT hurts CoT helps

The answer is the result of Let's think step by step.
performance performance adding 1 more ball (3 + 7) 3+7=10. 10/2=5. 5+1=6. The
plus 1. answer is 6.
Challenging BIG-Bench tasks and whether
chain-of-thought can solve them. Suzgun
et al., 2022.

34
Three implications of emergence

Unpredictable. Unintentional. One model, many-tasks.

Emergence cannot be Emergent abilities are Since scaling has
predicted solely by not explicitly specified unlocked emergent
extrapolating scaling by the trainer of the abilities, further scaling
curves from smaller language model (next can be expected to elicit
models. word prediction “only”). more abilities.

Scaling laws don’t apply for In the history of deep learning, Let’s scale more right? (Any
downstream tasks! has this been true before? undesirable emergent abilities?)

The “shallow” way is that it tells models about formatting or the

output space.

The more “profound” way is that models actually learn

in-context the relationship between inputs and outputs.

Large-enough models have signs of life for the profound way.

36
The shuffled label setting

Input: Great movie Output: positive Randomly shuffle these

labels so that there is
Input: Really boring Output: negative no correlation between
inputs and their
Input: Amazing performance Output: positive corresponding labels.
Input: Contains no wit Output: negative

[...]

Input: Loved it Output:

37
For GPT-3, shuffled labels barely hurts performance
In the shuffled labels setting,
in-context learning
performance barely drops.

This means the model

doesn’t look at <input,
output> mappings.

Not so fast…

Rethinking the role of demonstrations: what makes in-context learning work?

Min et al., 2021.

38
The flipped label setting

Input: Great movie Output: positive negative Flip all labels in the
examples.
Input: Really boring Output: negative positive
If model looks at <input,
Input: Amazing performance Output: positive negative output> relationships, then
it should guess the exact
Input: Contains no wit Output: negative positive opposite of the real
answer. So performance is
[...] ideally 0. (Random
guessing should be 50%
Input: Loved it Output: for binary classification.)

39
Flipped labels affects performance only for large models

In the “flipped-label” setting, small

models see no performance drop.

On the other hand, large models learn

to follow the flipped labels (accuracy
reverses).

Larger language models do in-context learning differently. Wei et al., 2023.

40
The semantically unrelated labels setting

Input: Great movie Output: positive foo Use semantically unrelated labels.

Input: Really boring Output: negative bar Now the model must look at the
<input, output> relationships to
Input: Amazing performance Output: positive foo figure out what the task is.

Input: Contains no wit Output: negative bar

[...]

Input: Loved it Output:

41
Large models don’t need semantic labels to figure out
the task
Small models have a large
performance drop. They need the
semantic labels.

Large models have a minimal

performance drop. They can figure out
the task even without semantic labels.

Larger language models do in-context learning differently. Wei et al., 2023.

42
Large LM intuition General idea

(3) Words have different Most data has varying

information densities, so give information densities, so
language models time to think. adaptive compute can help.

(4) Scaling model size and data is Plot scaling curves to see if
expected to continue improving doing more of something will be
loss. a good strategy.

(5) Overall loss improves To better understand aggregate

smoothly, but individual tasks metrics, decompose them into
can improve suddenly. individual categories.

43
Thanks.
X / Twitter: @_jasonwei

I’ve love your feedback on this talk: https://tinyurl.com/jasonwei

OceanofPDF.com the Hundred-Page Language Models Book - Andriy Burkov
93% (14)
OceanofPDF.com the Hundred-Page Language Models Book - Andriy Burkov
209 pages
Scaling Paradigms for Large Language Models
No ratings yet
Scaling Paradigms for Large Language Models
42 pages
Building LLMs - Stanford
No ratings yet
Building LLMs - Stanford
78 pages
AISpecialistCertPrep Day2 Exam1231
No ratings yet
AISpecialistCertPrep Day2 Exam1231
83 pages
Classroom Toolkit - Unlocking Generative AI Safely and Responsibly
No ratings yet
Classroom Toolkit - Unlocking Generative AI Safely and Responsibly
51 pages
2024 Stanford cs25 Guest Lecture Jason Wei
No ratings yet
2024 Stanford cs25 Guest Lecture Jason Wei
20 pages
14-LookingForward
No ratings yet
14-LookingForward
48 pages
Don't Teach. Incentivize
No ratings yet
Don't Teach. Incentivize
59 pages
Lecture 15 - Foundation Models - CLIP and GPT
No ratings yet
Lecture 15 - Foundation Models - CLIP and GPT
45 pages
Perspectives in Business Ethics
No ratings yet
Perspectives in Business Ethics
113 pages
Brief Introduction to LLM
No ratings yet
Brief Introduction to LLM
69 pages
Nn4nlp 02 LM
No ratings yet
Nn4nlp 02 LM
47 pages
10.48550 Arxiv.2204.02311
No ratings yet
10.48550 Arxiv.2204.02311
87 pages
19 20-gpt-3 Prompts
No ratings yet
19 20-gpt-3 Prompts
68 pages
lec20.LLM
No ratings yet
lec20.LLM
58 pages
Large Language Models Johns Hopkins University
No ratings yet
Large Language Models Johns Hopkins University
54 pages
Augmenting LLMs Survey
No ratings yet
Augmenting LLMs Survey
33 pages
Chowdhery Et Al. - 2022 - PaLM Scaling Language Modeling With Pathways
No ratings yet
Chowdhery Et Al. - 2022 - PaLM Scaling Language Modeling With Pathways
83 pages
LLMand Logicor Mimick
No ratings yet
LLMand Logicor Mimick
11 pages
Lecture Notes
No ratings yet
Lecture Notes
86 pages
Foundations of LLM
No ratings yet
Foundations of LLM
231 pages
Foundations of Large Language Models 1738142777
No ratings yet
Foundations of Large Language Models 1738142777
101 pages
Large Language Models: Dr. Asgari, Dr. Rohban, Soleymani Fall 2023
No ratings yet
Large Language Models: Dr. Asgari, Dr. Rohban, Soleymani Fall 2023
53 pages
PIIS2589004224005558
No ratings yet
PIIS2589004224005558
24 pages
LLM_introduction 2024
No ratings yet
LLM_introduction 2024
77 pages
LLM_book_43-102
No ratings yet
LLM_book_43-102
60 pages
Eights - LLM Model
No ratings yet
Eights - LLM Model
10 pages
01-Transformer Based NLP Applications
No ratings yet
01-Transformer Based NLP Applications
55 pages
XCS224N_Module4_Slides
No ratings yet
XCS224N_Module4_Slides
91 pages
Eights LLM Model App
No ratings yet
Eights LLM Model App
8 pages
1719720399971
No ratings yet
1719720399971
51 pages
2005 14165v3 PDF
No ratings yet
2005 14165v3 PDF
74 pages
Training the application of LLM
No ratings yet
Training the application of LLM
68 pages
LLM Basics
No ratings yet
LLM Basics
35 pages
Roisinluo Reasoning in LLMs
No ratings yet
Roisinluo Reasoning in LLMs
72 pages
LLM - Michael R Douglas
No ratings yet
LLM - Michael R Douglas
47 pages
ML 22
No ratings yet
ML 22
29 pages
Language Models: A Guide for the Perplexed
No ratings yet
Language Models: A Guide for the Perplexed
35 pages
(2303.18223) A Survey of Large Language Models
No ratings yet
(2303.18223) A Survey of Large Language Models
115 pages
Cs224n Text Generation
No ratings yet
Cs224n Text Generation
73 pages
Language Model Evaluation in Open-Ended Text Gener
No ratings yet
Language Model Evaluation in Open-Ended Text Gener
70 pages
5th Unit
No ratings yet
5th Unit
36 pages
Book TheLMbook Sample
No ratings yet
Book TheLMbook Sample
30 pages
Large Language Model
0% (1)
Large Language Model
38 pages
Large Language Models Need Symbolic Ai
No ratings yet
Large Language Models Need Symbolic Ai
6 pages
ssrn-4504303
No ratings yet
ssrn-4504303
8 pages
Neural Text Generation: A Practical Guide: Ziang Xie Zxie@cs - Stanford.edu
No ratings yet
Neural Text Generation: A Practical Guide: Ziang Xie Zxie@cs - Stanford.edu
21 pages
Quick Start Guide To Large Language Models Second Edition Sinan Ozdemir download
100% (1)
Quick Start Guide To Large Language Models Second Edition Sinan Ozdemir download
89 pages
chapter_5
No ratings yet
chapter_5
44 pages
On The Dangers of Stochastic Parrots: Can Language Models Be Too Big?
No ratings yet
On The Dangers of Stochastic Parrots: Can Language Models Be Too Big?
14 pages
Generative AI With LArge Language Models
No ratings yet
Generative AI With LArge Language Models
36 pages
Quick Start Guide to Large Language Models Second Edition Sinan Ozdemir - Read the ebook online or download it to own the full content
100% (1)
Quick Start Guide to Large Language Models Second Edition Sinan Ozdemir - Read the ebook online or download it to own the full content
62 pages
MLSys Class LLM Introduction
No ratings yet
MLSys Class LLM Introduction
43 pages
On The Application of Large Language Models For Language Teaching and Assessment Technology
No ratings yet
On The Application of Large Language Models For Language Teaching and Assessment Technology
25 pages
RNN
No ratings yet
RNN
22 pages
Language Models Are General Purpose Interfaces
No ratings yet
Language Models Are General Purpose Interfaces
32 pages
No - Ntnu Inspera 187579291 24496466
No ratings yet
No - Ntnu Inspera 187579291 24496466
92 pages
2022-foundations-tutorial3-sunwang-deeplearning4nlp
No ratings yet
2022-foundations-tutorial3-sunwang-deeplearning4nlp
103 pages
Explanation Based Learning: Fundamentals and Applications
From Everand
Explanation Based Learning: Fundamentals and Applications
Fouad Sabry
No ratings yet
Language Identification: Fundamentals and Applications
From Everand
Language Identification: Fundamentals and Applications
Fouad Sabry
No ratings yet
Natural Language Understanding: Fundamentals and Applications
From Everand
Natural Language Understanding: Fundamentals and Applications
Fouad Sabry
No ratings yet
Programming Language Concepts: Improving your Software Development Skills
From Everand
Programming Language Concepts: Improving your Software Development Skills
Oliver Wegner
No ratings yet
Influence of Tantra in The Rituals of Shri Jagganath
No ratings yet
Influence of Tantra in The Rituals of Shri Jagganath
5 pages
Tantra - Highly - Misunderstood, - Misinterpreted - Thread - by - Satyamshakti9 - Sep 12, 23 - From - Rattibha
No ratings yet
Tantra - Highly - Misunderstood, - Misinterpreted - Thread - by - Satyamshakti9 - Sep 12, 23 - From - Rattibha
25 pages
Enlightened - Sages - in - Our - Thread - by - Satyamshakti9 - Oct 5, 23 - From - Rattibha
No ratings yet
Enlightened - Sages - in - Our - Thread - by - Satyamshakti9 - Oct 5, 23 - From - Rattibha
20 pages
Sha-3 Selection Announcement
No ratings yet
Sha-3 Selection Announcement
1 page
GraphRAG + GPT-4o Mini - Building An AI Knowledge Graph at Low Cost - by Shuyi Wang - Jul, 2024 - Cubed
No ratings yet
GraphRAG + GPT-4o Mini - Building An AI Knowledge Graph at Low Cost - by Shuyi Wang - Jul, 2024 - Cubed
31 pages
adaptMLLM Fine-Tuning Multilingual Language Models
No ratings yet
adaptMLLM Fine-Tuning Multilingual Language Models
24 pages
Think-On-graph- Deep and Responsible Reasoning of Large Language Model on Knowledge Graph
No ratings yet
Think-On-graph- Deep and Responsible Reasoning of Large Language Model on Knowledge Graph
30 pages
Gemini v1 5 Report
No ratings yet
Gemini v1 5 Report
58 pages
Offset Unlearning for Large Language Models
No ratings yet
Offset Unlearning for Large Language Models
11 pages
Chapter 1
No ratings yet
Chapter 1
45 pages
Mini Combined Report
No ratings yet
Mini Combined Report
27 pages
Mint_Delhi_10-12-2024
No ratings yet
Mint_Delhi_10-12-2024
18 pages
openvino-toolkit-llms-solution-white-paper
No ratings yet
openvino-toolkit-llms-solution-white-paper
21 pages
Elastic Ebook Building Ai Powered Search Experiences
No ratings yet
Elastic Ebook Building Ai Powered Search Experiences
33 pages
MineDreamer: Learning To Follow Instructions Via Chain-of-Imagination For Simulated-World Control
No ratings yet
MineDreamer: Learning To Follow Instructions Via Chain-of-Imagination For Simulated-World Control
41 pages
Gen AI
No ratings yet
Gen AI
14 pages
State of AI Report - 2024 ONLINE
No ratings yet
State of AI Report - 2024 ONLINE
212 pages
AI in Retail - UvA 2025
No ratings yet
AI in Retail - UvA 2025
67 pages
Conversational Health Agents: A Personalized LLM-Powered Agent Framework
No ratings yet
Conversational Health Agents: A Personalized LLM-Powered Agent Framework
23 pages
Gemini 1.5 Pro
No ratings yet
Gemini 1.5 Pro
21 pages
Xu Et Al - 2024 - A Comprehensive Study of Jailbreak Attack Versus Defense For Large Language
No ratings yet
Xu Et Al - 2024 - A Comprehensive Study of Jailbreak Attack Versus Defense For Large Language
18 pages
The State OF AI Sercurity
No ratings yet
The State OF AI Sercurity
21 pages
2404.14619v2
No ratings yet
2404.14619v2
10 pages
Levels of AI Agents - From Rules to Large Language Models
No ratings yet
Levels of AI Agents - From Rules to Large Language Models
8 pages
s41599-024-03611-3
No ratings yet
s41599-024-03611-3
24 pages
Student Survival Guide
No ratings yet
Student Survival Guide
32 pages
Ml5
No ratings yet
Ml5
25 pages
mcp8
No ratings yet
mcp8
3 pages
LLMLight Large Language Models as Traffic Signal Control Agent
No ratings yet
LLMLight Large Language Models as Traffic Signal Control Agent
13 pages
UReader - Universal OCR-free Visually-Situated Language Understanding With Multimodal Large Language Model
No ratings yet
UReader - Universal OCR-free Visually-Situated Language Understanding With Multimodal Large Language Model
18 pages
Advancing Reasoning in Large Language Models. Promising Methods and Approaches
No ratings yet
Advancing Reasoning in Large Language Models. Promising Methods and Approaches
9 pages
132523
No ratings yet
132523
50 pages

Jason Wei Stanford cs330 Talk

Uploaded by

Jason Wei Stanford cs330 Talk

Uploaded by

Some intuitions about large language models

Thing I’ve been thinking about recently: Manually

Your biological neural net makes many observations

These intuitions can be valuable.

(I once manually annotated an entire lung cancer image classification

(1) Next-word (2) Learning (3) Words have

(4) Scaling model size (5) Overall loss (6) Large-enough

Data quantity A lot, ~ the internet Small,

Data quality Low High

Goal Most “learning” occurs Makes model usable by

The best language model is the one

Extreme multi-task learning!

Input Target Task

Learning <input, output> relationships can be cast as

This is called in-context learning.

Give examples with some delimiters as

Optionally, include a task instruction.

Language models are few-shot learners. Brown et al., 2020.

Task: remove random symbols

Input: s.u!c/c!e.s s i/o/n

Note that performance

Language models are few-shot learners. Brown et al., 2020.

But there is no first-principles reason why we have to

When we communicate with humans, we give them

Not all words are equal.

Question: What is the square of ((8-2)*3+4)^3 / 8?

Chain-of-thought prompting elicits reasoning in large

Solve the subquestions

1. First I’ll read the literature to understand the

2. Based on the above, the biggest gaps in the

3. To solve these gaps, we first need to study and

Remember “moles” from high

“as many elementary entities

2018 2019 2020 2021 2022 2023

Jason’s rephrase: You should expect to

Seven orders of magnitude

Small language model Large language model

Memorization is costly More generous with memorizing tail

First-order correlations Complex heuristics

Overall loss = 1e-10 * loss_grammar +

Flat Inverse scaling (2.5%)

Emergence is a qualitative change that

Future ML systems will be qualitatively different (2022).

Model Training Training

Performance is flat for small models.

Open research question: is it possible to predict

Input (English): I like to

Model “curie” suddenly

I like to play soccer and I like to play soccer and

Q: What is half of (3 + 7) plus one?

text-ada-001 (small LM) text-davinci-002 (large LM)

CoT hurts CoT helps

Unpredictable. Unintentional. One model, many-tasks.

Suggested further reading:

The “shallow” way is that it tells models about formatting or the

The more “profound” way is that models actually learn

Large-enough models have signs of life for the profound way.

Input: Great movie Output: positive Randomly shuffle these

Input: Loved it Output:

This means the model

Rethinking the role of demonstrations: what makes in-context learning work?

In the “flipped-label” setting, small

On the other hand, large models learn

Larger language models do in-context learning differently. Wei et al., 2023.

Input: Contains no wit Output: negative bar

Input: Loved it Output:

Large models have a minimal

Larger language models do in-context learning differently. Wei et al., 2023.

(3) Words have different Most data has varying

(5) Overall loss improves To better understand aggregate

I’ve love your feedback on this talk: https://tinyurl.com/jasonwei

You might also like