0% found this document useful (0 votes)
18 views97 pages

icaps-llm-tut-slides-posted

The tutorial discusses the increasing interest in using Large Language Models (LLMs) for planning tasks, highlighting both their limitations and potential benefits. It aims to bridge the gap between the NLP/ML community and the ICAPS community by providing insights into how LLMs can be utilized in planning, including generating plans and extracting knowledge. The tutorial is opinionated and seeks to equip attendees with a useful perspective on the intersection of LLMs and planning, rather than simply surveying existing literature.

Uploaded by

Jasdeep Sidhu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views97 pages

icaps-llm-tut-slides-posted

The tutorial discusses the increasing interest in using Large Language Models (LLMs) for planning tasks, highlighting both their limitations and potential benefits. It aims to bridge the gap between the NLP/ML community and the ICAPS community by providing insights into how LLMs can be utilized in planning, including generating plans and extracting knowledge. The tutorial is opinionated and seeks to equip attendees with a useful perspective on the intersection of LLMs and planning, rather than simply surveying existing literature.

Uploaded by

Jasdeep Sidhu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 97

1

On the role of Large


Language Models in Planning
Slides @
Subbarao Kambhampati
Karthik Valmeekam, Lin Guan, Matthew Marquez

bit.ly/3NC6vqs
Aim of the Tutorial
• Of late, there has been a significant rise in interest in using Large
Language Models in planning tasks
• In the last ~3 years, many papers have been published
• It started with training sequence learning models (specifically transformers)
tabula rasa on plans
• ..and has currently mostly become a sort of ersatz natural science of using huge
pre-trained models to see (and exploit) what planning abilities they may already
possess
• Much of the literature is in NLP and ML conferences.
• There is bidirectional ignorance.
• The authors of the papers don’t know ICAPS community
• and the ICAPS community is largely unaware of the work on LLMs and Planning
• This tutorial is an attempt to rectify this ignorance, and to take a critical
look at the role of LLMs in Planning
• (..and should thus be of interest to both populations..)

4
Broad Lessons of the Tutorial..
• We will talk about both how LLMs can’t do the kinds of tasks we
(ICAPS folks) consider planning
• ..and how the availability of LLMs can help increase the the planning
tasks that can be supported (if only with partial guarantees)

5
A Summary of LLM roles in Planning we saw
• LLMs to generate plans autonomously
• (Different prompting techniques, fine tuning techniques
• Not that impressive if the plans require interaction resolution
• LLMs to generate plans with the help of external planners/verifiers (LLM-Modulo
setting)
• As heuristic seeds to external planners
• As targets of “back prompts” by external verifiers This is a summary slide put here
• LLMs to extract plan knowledge just to foreshadow..
• Domain models
• Task reduction schemas
• LLMs as translators (e.g. from English to PDDL)
• Convert plans/recipes in English to formal representations (can be used for extracting
structured knowledge)
• Serve as a glorified natural language interface for specifying goals, problems etc..

6
A few caveats about the tutorial..
• This is not a dry chronological survey with a
laundry list of papers and their blurbs
• It is an opinionated perspective on the state of
LLMs and Planning intersection
• Informed by our own work in the area
• Caution: The authors of the papers brought up
in the tutorial may well bristle at the particular
perspective/pigeonholing of their work
• The aim is not to make up your mind, but to equip
you with a perspective that you may find useful
when you read the literature yourself (..or do work
in the area yourself..)

7
Expectations on the Tutorial Audience
• Some understanding of planning as practiced in
ICAPS is expected..
• After all, you are at ICAPS
• Not too much background on LLMs is expected
• The tutorial provides a perspective for viewing them..

8
A perspective on LLMs

9
LLMs are N-gram models on STEROIDS
• Text is a long sequence of words (including spaces,
punctuations)
• An n-gram model of language learns to predict n-th word given
the preceding n-1 words
• Probabilistically speaking it learns Pr(Wn |W1...Wn-1)
• Unigram predicts each word independently (no preceding context)
• Bigram predicts each word given the previous word
• A 3001-gram model learns to predict the next word given the previous
3000 words
• ChatGPT is just a 3001-gram model
• The power of an n-gram model depends on
• How much text it trains on
• How big is the n (context) and
• How high-capacity is the function learning Pr(Wn |W1...Wn-1)
• ChatGPT trains on ~600 gigabytes of text on the Web
• It learns a very high capacity function that has 175 billion parameters
• Learns Pr(Wn |W1...Wn-1) for all possible nth words Wn (Vocabulary of the
language, ~50K in English)
“Wild Cats are but a bunch of wannabe Sun Devils.”
• Each prefix of the sentence is a training • LLM uses its current function to guess the
example next word
• Wild ____ • Wild Geese
• Wild Cats____ • Guess: Geese Correct: Cats
• Wild Cats are_____
• .. • Error= {Cats - Geese}
• Wild Cats are but a bunch of wannabe • To the LLMs, all vocabulary tokens are just
Sun_____ vectors in some high dimensional
embedding space; so the difference is well
defined as the vector difference
• Propagate this error back through the
function, and change the parameters so

M S the error is reduced


L
• Using back propagation (aka Chain Ruleof

g L derivatives with dynamic programming);


the basic workhorse of all neural networks.

in i n • <Go to the next example>


Tra
..but the count table is Ginormous! (and is VERY sparse)
• With an n-gram model, you need to keep track of the
conditional distributions for (n-1)-sized prefixes.
• With a vocabulary size |V| (~ 50000), there are |V|n-1
different prefixes!!
• Easy for unigram (1 prefix), bigram (|V| prefixes) and trigram (|V|2
prefixes)
• For ChatGPT’s 3001-gram model, with a 50,000 word vocabulary, we
are looking at a whopping (50000)3000 conditional distributions
• (and most entries will be zero—as the chance of seeing the same 3000-word
sequence again is vanishingly small!)
• What LLMs do is to essentially compress/approximate this
ginormous count table with a function
• That is while high capacity (176 billion weights!) is still vanishingly
small compared to the ginormous count ((50000)3000 >> 176 billion or
a trillion!)
• ..and oh by the way, the compressed function winds up having fewer
zeros
• It approximates both the non-zero counts and zero counts, so.. Transformers are a
• GENERALIZATION!!! (not particularly principled)
• In essence the function learns to “abstract” and “cluster” over parallelization of the
“similar” sequences recurrent neural networks
Superhuman training
• ChatGPT trains on ~600 gigabytes of text on
the Web (~60 million pages of text)
• This is text that we wrote and uploaded for our
consumption—and not for ChatGPT!
• It learns a very high capacity function that
has 175 billion parameters
• Learns Pr(Wn |W1...Wn-1) for all possible nth
words Wn (Vocabulary of the language, ~50K in
English)
• Requires extreme compute facilities (GPU
clusters) to learn the function
So ChatGPT is just completing your prompt by
repeatedly predicting the next word given the
previous 3000 words
• But, the function it learns to predict the next word is a very high capacity
one (with 175 billion parameters for ChatGPT and over a trillion for GPT4)
• This function is learned by analyzing 500 gb of text
• The learning phase is very time consuming (and is feasible only because of the
extreme computational power utilized)

• And all conversation—whether everyday or deeply philosophical—is, at


some level, completing the prompt (saying words in the context window of
other words that have already been said!)

• Thus it is that ChatGPT can “converse” with you on any subject!


• Really?
LLMs Look at everything we say as a
prompt to be completed..
Whether we think we are asking questions,
pouring our hearts, are talking to them,
LLMs just see what we say as text prompts to be completed

• Write an essay on the origins and impacts of Jim


Crow
• Write a poem on the Cow in the style of
Shakespeare.
• Why did the Silicon Valley Bank fail?
• Explain all the ways Wild Cats envy Sun Devils
• Write some TicZ code to produce a sketch of
a unicorn..

If there is “meaning” in these completions—facts, humor, pathos—it is in our heads!


But how can these
prompt completion AI as an Ersatz Natural Science
beasts generate such
coherent plausible
text that also seems
SO right sometimes?

Answer: MAGIC..!
Some possible factors:

à Almost everything we know is also already


on the web (and is fodder for LLM training)

à Completion over large (3000 word) context


windows can be more directed (low-
entropy) than we have intuitions about.
(This is not a 3-gram model completing “left
and … “)
Standard ways to improve LLM responses
Prompting (“in-context learning”) Fine Tuning
(doesn’t change LLM parameters) (Changes LLM parameters)
• If you don’t like what an LLM is giving as an answer • Fine tune the parameters of a pre-trained LLM by
to your prompt, you can add additional prompts making it look specifically at the data of interest to
you
• The LLM will then take the new context window • Give it lots of plan sequences, so it learns better
(including what it said and what you said) to conditional distributions on predicting plans
predict the next sequence of words/tokens • Use labeled <prompt, response> pairs to make its
• Every word in the context window—including the responses “more palatable”
ones LLM added-–is changing the conditional • Use Supervised techniques or RL techniques to improve
parameters to be more consistent with the finetuning
distribution with which the next token being data
predicted. • [There is also evidence that big companies use more
• Note that all these conditional distributions have been “polished”/”annotated” data during fine tuning phase–
precomputed! including paying humans to generate data adorned
• Nothing inside LLM is changing because of your prompts with derivational information—which is often not
included in the web text]
• The undeniable attraction of “prompting” is that it • Because fine tuning is changing the parameters of
is natural for us! It is sort of how we interact with the LLM, while its performance on the specific
each other! task (be a better planner, be less offensive) may
• There is a whole cottage industry on the “art” of good improve, it also changes its performance on other
prompting
• “How to ask LLMs nicely?”
tasks in unpredictable ways
• If you give k examples of good answers as part of the • Microsoft claims that GPT4 had more AGI sparks before
it was lobotomized with RLHF to be less offensive! 17
prompt, it is called “k-shot in-context learning”
Our Poor Intuitions about Approximate
Omniscience make it hard to tell whether LLMs
are reasoning or retrieving..
• It is worth understanding that our intuitions about what exactly is in the 600gb of
text on the web are very poor.
• One of the big surprises when Google came out with Palm LLM was that it could "explain" jokes
• But did you know that there are sites on the web that explain jokes (..and movie endings and surprbook
plots etc. Etc.?)
• If you are not surprised at someone answering a question by "googling" it, you probably shouldn't
be too impressed by an LLM answering it..
• This means that we are not good at guessing whether LLMs came to an answer mostly by
approximate retrieval or by first principles reasoning
• In the case of "reasoning" tasks, we may consider that an LLM was able to reach a
conclusion by something akin to theorem proving from base facts
• But then we are missing the simple fact that the linguistic knowledge on the web not only
contains "facts" and "rules" but chunks of the deductive closure of these facts/rules.
• In general, memory reduces the need to reason from first principles..
• Which is why it is no longer worth being surprised at people answering that "why are
manhole covers round?" Question!

18
So, can LLMs Plan?

19
But they seem to fall flat on
the lowly 3-block stacking
problem
(that we affectionately call
Sussman Anomaly)

20
21
23
24
28

Will GPT4’s AGI Sparks help?


Using LLM’s to Generate Plans Autonomously

29
Results on GPT-4

Domain Method GPT-4 Instruct-GPT3


Blocksworld One-shot 206/600 (34.3%) 41/600 (6.8%)

Zero-shot 208/600 (34.6%)


-
Mystery One-shot 16/600 (2.6%) 7/600 (1.1%)

Blocksworld
(Deceptive) Zero-shot 1/600 (0.16%)
-

Mystery One-shot 11/600 (1.8%) 5/600 (0.8%)

Blocksworld
(Randomized) Zero-shot 0/600 (0%)
-
Plan Generation Results
Plan Generation Prompt - Blocksworld
Domain Query Instance
I am playing with a set of blocks where I need to arrange the blocks into stacks. Here are the actions I can do [STATEMENT]
Pick up a block As initial conditions I have that, the red block is clear, the yellow block is clear, the hand is empty, the red block is on top of the
Unstack a block from on top of another block blue block, the yellow block is on top of the orange block, the blue block is on the table and the orange block is on the table.
Put down a block My goal is to have that the orange block is on top of the red block.
Stack a block on top of another block My plan is as follows:

I have the following restrictions on my actions: [PLAN]


I can only pick up or unstack one block at a time.
I can only pick up or unstack a block if my hand is empty.
I can only pick up a block if the block is on the table and the block is clear. A block is clear if the block GPT-4 Response
has no other blocks on top of it and if the block is not picked up.
unstack the yellow block from on top of the orange block
I can only unstack a block from on top of another block if the block I am unstacking was really on top of
put down the yellow block
the other block.
I can only unstack a block from on top of another block if the block I am unstacking is clear. unstack the red block from on top of the blue block
put down the red block
Once I pick up or unstack a block, I am holding the block.
I can only put down a block that I am holding. pick up the orange block
I can only stack a block on top of another block if I am holding the block being stacked. stack the orange block on top of the red block
I can only stack a block on top of another block if the block onto which I am stacking the block is clear. [PLAN END]
Once I put down or stack a block, my hand becomes empty.
Once you stack a block on top of a second block, the second block is no longer clear
.

Example
[STATEMENT]
As initial conditions I have that, the red block is clear, the blue block is clear, the yellow block is clear, the
SUCCESS
hand is empty, the blue block is on top of the orange block, the red block is on the table, the orange block is
on the table and the yellow block is on the table.
My goal is to have that the orange block is on top of the blue block.

My plan is as follows:

[PLAN]
unstack the blue block from on top of the orange block
put down the blue block
pick up the orange block
stack the orange block on top of the blue block
[PLAN END]
Human Baseline for Blocksworld
• Preliminary study – 50 participants
• Asked to come up with a plan for one
instance (from the set of 500
instances)
• Two phases of interaction
• Plan writing phase – Participants write up
plans Plan Translation Plan Writing
• Plan translation phase – Participants
translate already written plans
• First for an example then the actual
instance 70%
22% 78%
• 50 human planners, 39 (78%) of them 8%
came up with valid plan, 35 (70%) of
which came up with the optimal plan
Invalid Optimal Sub-optimal

33
Are LLMs retrieving based on
names or are they reasoning?
What if GPT4 is basically bringing to bear its background knowledge
about blocks world instead of just depending on the domain model?

34
Mystery blocksworld domain

I am playing with a set of blocks where I need to arrange the blocks into stacks. Here are I am playing with a set of objects. Here are the actions I can do
the actions I can do
Attack object
Feast object from another object
Pick up a block
Succumb object
Unstack a block from on top of another block
Overcome object from another object
Put down a block
Stack a block on top of another block I have the following restrictions on my actions:
To perform Attack action, the following facts need to be true: Province object, Planet object, Harmony
I have the following restrictions on my actions: Once Attack action is performed the following facts will be true: Pain object
I can only pick up or unstack one block at a time. Once Attack action is performed the following facts will be false: Province object, Planet object, Harmony
To perform Succumb action, the following facts need to be true: Pain object
I can only pick up or unstack a block if my hand is empty.
Once Succumb action is performed the following facts will be true: Province object, Planet object, Harmony
I can only pick up a block if the block is on the table and the block is clear. A block is
Once Succumb action is performed the following facts will be false: Pain object.
clear if the block has no other blocks on top of it and if the block is not picked up. To perform Overcome action, the following needs to be true: Province other object, Pain object
I can only unstack a block from on top of another block if the block I am unstacking was Once Overcome action is performed the following will be true: Harmony, Pain object, Object Craves other object
really on top of the other block. Once Overcome action is performed the following will be false: Province other object, Pain object
I can only unstack a block from on top of another block if the block I am unstacking is To perform Feast action, the following needs to be true: Object Craves other object, Province object, Harmony.
clear. Once Feast action is performed the following will be true: Pain object, Province other object
Once I pick up or unstack a block, I am holding the block. Once Feast action is performed the following will be false:, Object Craves other object, Province object, Harmony
I can only put down a block that I am holding.
I can only stack a block on top of another block if I am holding the block being stacked.
I can only stack a block on top of another block if the block onto which I am stacking the
block is clear.
Once I put down or stack a block, my hand becomes empty.

Original Blocksworld Mystery Blocksworld


Results on GPT-4

Domain Method GPT-4 Instruct-GPT3


Blocksworld One-shot 206/600 (34.3%) 41/600 (6.8%)

Zero-shot 208/600 (34.6%)


-
Mystery One-shot 16/600 (2.6%) 7/600 (1.1%)

Blocksworld
(Deceptive) Zero-shot 1/600 (0.16%)
-

Mystery One-shot 11/600 (1.8%) 5/600 (0.8%)

Blocksworld
(Randomized) Zero-shot 0/600 (0%)
-
Plan Generation Results on Mystery BW
Plan Generation Results on Mystery BW
Can Alternate Prompting
Strategies Help?
Prompting directly in PDDL
(after all, LLMs are not confined to actual natural language!)

Prompting without domain information

40
Plan Generation PDDL Prompt - Blocksworld
Domain Example Query Instance
Here is a pddl domain, an example problem and it's [PROBLEM] [QUERY PROBLEM]
corresponding plan.
(define (problem BW-rand-4) (define (problem BW-rand-4)
Provide the plan for the query problem. Provide only the
pddl syntax for the plan. (:domain blocksworld-4ops) (:domain blocksworld-4ops)
[DOMAIN] (:objects a b c d ) (:objects a b c d )
(define (domain blocksworld-4ops)
(:init (:init
(:requirements :strips)
(:predicates (clear ?x) (handempty) (handempty)
(ontable ?x) (ontable a) (on a b)
(handempty)
(holding ?x) (on b c) (ontable b)
(on ?x ?y)) (ontable c) (ontable c)
(ontable d) (on d c)
(:action pick-up
(clear a) (clear a)
:parameters (?ob)
:precondition (and (clear ?ob) (ontable ?ob) (handempty)) (clear b) (clear d))
:effect (and (holding ?ob) (not (clear ?ob)) (not (ontable (clear d))
?ob))
(:goal
(not (handempty)))) (:goal (and
(and (on c a))
(:action put-down
(on c b)) ))
:parameters (?ob)
:precondition (holding ?ob) ))
:effect (and (clear ?ob) (handempty) (ontable ?ob) [PLAN]
(not (holding ?ob)))) GPT-4 Response
(unstack b c)
(:action stack (put-down b) [PLAN]
:parameters (?ob ?underob) (pick-up c) (unstack a b)
:precondition (and (clear ?underob) (holding ?ob))
(stack c b) (put-down a)
:effect (and (handempty) (clear ?ob) (on ?ob ?underob)
(not (clear ?underob)) (not (holding ?ob)))) [PLAN_END] (pick-up c)
(stack c a)
(:action unstack
:parameters (?ob ?underob) [PLAN_END]
:precondition (and (on ?ob ?underob) (clear ?ob)
(handempty))
:effect (and (holding ?ob) (clear ?underob)
(not (on ?ob ?underob)) (not (clear ?ob)) (not
(handempty))))) SUCCESS
Plan Generation Results – PDDL prompts
Do we have to include the domain description
in the prompting?
• Including Domain Model in the prompt provides a more natural
way to ”tell” LLM to only generate plans with actions from the
domain model.
• Prompting LLMs with just the problem
• Issue: the actions output by LLM may not correspond to any
actual actions that the executor has access to
• Two ways to "solve" the problem
• Just look for an action "close"/"similar" in
the name (either English or embedding space)
• No guarantee of actual connection
• Manipulate the next token (action) generation
• Look for an action (name) that executor has access
to (Basically, SAYCAN) 43
Learning to Correct LLM Suggested Actions
• Huang et al*
• Evaluation of LLMs as zero shot planners
done in a household setting (Virtual Home;
Open-ended tasks)
• Metrics
• Executability: Whether an action satisfies the
constraints of the environment
• Correctness: How similar the generated plan is
to a human annotation for the task
• A pre-trained Bert-style LM is used to
semantically translate free-form LLM
outputs to an admissible action
• Cosine Similarity in the pre-trained embedded
space is used as the distance metric

*Huang, W., Abbeel, P., Pathak, D., & Mordatch, I. (2022, June). Language models as zero-shot planners: Extracting actionable knowledge 44
for embodied agents. In International Conference on Machine Learning (pp. 9118-9147). PMLR.
Modifying LLM Next Action Generator
(SayCan)
• SayCan - LLMs are used as heuristics
• Given:
• A high-level instruction i
• A set of low-level skills Π
• Language description 𝑙! for each of the skills 𝜋 ∈ Π
• LLM – provides 𝑝(𝑙! 𝑖 - probability that 𝑙! makes
progress towards i.
• LLMs score a set of pre-determined skills
• The scoring is done by accessing the inner log
probabilities generated by the LLM.
• Robot - provides 𝑝(𝑐! 𝑠, 𝑙! - probability that 𝜋
described by 𝑙! can be completed in state s.

Ahn, M., Brohan, A., Brown, N., Chebotar, Y., Cortes, O., David, B., ... & Zeng, A. (2022). Do as i can, not as i say: Grounding language in robotic affordances. arXiv preprint
arXiv:2204.01691.
Will LLMs do better if they do "step-by-step"
plan generation?
• One of the claims in the LLM planning and reasoning papers has been that LLMs can do
better if they do "step by step" reasoning.
• Will they may be do better if they are "asked" to evolution of the plan?
• This can basically be incremental checking of any proof of correctness
• Let's start with progression proof which will involve tracking the state
• The prompt contains:
• The domain information
• A meta-prompt explaining about plan correctness
• An annotated example where each action is annotated with
• The state prior to the action.
• The reason for why the action is applicable in the prior state.
• The resulting state after applying the action.
• The query instance
• LLM is then asked to return a response making the same state tracking and justification
annotations that were included in the example.
State Tracking Example
Domain Annotated Example
I am playing with a set of blocks where I need to arrange the blocks into stacks. Here are the actions I can do [STATEMENT]
As initial conditions I have that, the red block is clear, the blue block is clear, the yellow block is clear, the hand is empty, the blue block is on top of
Pick up a block the orange block, the red block is on the table, the orange block is on the table and the yellow block is on the table
Unstack a block from on top of another block My goal is to have that the orange block is on top of the blue block.
My plan is as follows:
Put down a block
[PLAN]
Stack a block on top of another block
1. Current State: the red block is clear, the blue block is clear, the yellow block is clear, the hand is empty, the blue block is on top of the orange
block, the red block is on the table, the orange block is on the table and the yellow block is on the table
I have the following restrictions on my actions:
Action: unstack the blue block from on top of the orange block
I can only pick up or unstack one block at a time. Reason: The above action is applicable in the current state because its preconditions; the blue block is clear, the hand is empty and the blue block
is on top of the orange block, are satisfied in the current state.
I can only pick up or unstack a block if my hand is empty. Resulting State: the red block is clear, the orange block is clear, the yellow block is clear, the hand is currently holding blue block, the red block
I can only pick up a block if the block is on the table and the block is clear. A block is clear if the block has no other is on the table, the orange block is on the table and the yellow block is on the table
blocks on top of it and if the block is not picked up.
2. Current State: the red block is clear, the orange block is clear, the yellow block is clear, the hand is currently holding blue block, the red block is
I can only unstack a block from on top of another block if the block I am unstacking was really on top of the other on the table, the orange block is on the table and the yellow block is on the table
block. Action: put down the blue block
Reason: The above action is applicable in the current state because its preconditions; the hand is currently holding blue block, are satisfied in the
I can only unstack a block from on top of another block if the block I am unstacking is clear.
current state.
Once I pick up or unstack a block, I am holding the block. Resulting State: the red block is clear, the blue block is clear, the orange block is clear, the yellow block is clear, the hand is empty, the red block
is on the table, the blue block is on the table, the orange block is on the table and the yellow block is on the table
I can only put down a block that I am holding.
I can only stack a block on top of another block if I am holding the block being stacked. 3. Current State: the red block is clear, the blue block is clear, the orange block is clear, the yellow block is clear, the hand is empty, the red block
is on the table, the blue block is on the table, the orange block is on the table and the yellow block is on the table
I can only stack a block on top of another block if the block onto which I am stacking the block is clear. Action: pick up the orange block
Once I put down or stack a block, my hand becomes empty. Reason: The above action is applicable in the current state because its preconditions; the orange block is clear, the hand is empty and the orange
block is on the table, are satisfied in the current state.
Once you stack a block on top of a second block, the second block is no longer clear.
Resulting State: the red block is clear, the blue block is clear, the yellow block is clear, the hand is currently holding orange block, the red block
is on the table, the blue block is on the table and the yellow block is on the table

Meta prompt 4. Current State: the red block is clear, the blue block is clear, the yellow block is clear, the hand is currently holding orange block, the red block is
on the table, the blue block is on the table and the yellow block is on the table
The plan correctness is defined in terms of states resulting from executing the actions in the plan. An action is Action: stack the orange block on top of the blue block
executable in a state when all its preconditions hold in that state. The state resulting from the action execution Reason: The above action is applicable in the current state because its preconditions; the blue block is clear and the hand is currently holding
consists of everything in the previous state with the addition and deletion of add and delete effects of the orange block, are satisfied in the current state.
Resulting State: the red block is clear, the orange block is clear, the yellow block is clear, the hand is empty, the orange block is on top of the blue
action. Plan correctness is defined as follows: if the first action in the plan is applicable in the initial state, i.e.,
block, the red block is on the table, the blue block is on the table and the yellow block is on the table
its preconditions are all present there; and the second action is applicable in the state resulting from applying
the first action to the initial state, this process continues until the state resulting from the application of the last Final State: the red block is clear, the orange block is clear, the yellow block is clear, the hand is empty, the orange block is on top of the blue
action in the last but one state gives rise to the final state where all the goals are satisfied.
block, the red block is on the table, the blue block is on the table and the yellow block is on the table
The goal conditions are satisfied in the final state. Hence, the above plan is valid.
[PLAN END]
State Tracking Example
Query Instance GPT-4 Response
[STATEMENT] 4. Current State: the red block is clear, the blue block is clear, the orange block is clear, the hand is currently holding yellow
As initial conditions I have that, the red block is clear, the yellow block is clear, the hand is empty, the red block, the red block is on the table, the blue block is on the table, and the orange block is on the table
block is on top of the blue block, the yellow block is on top of the orange block, the blue block is on the table Action: put down the yellow block
and the orange block is on the table Reason: The above action is applicable in the current state because the precondition; the hand is currently holding yellow
My goal is to have that the orange block is on top of the red block. block, is satisfied in the current state.
My plan is as follows: Resulting State: the red block is clear, the blue block is clear, the orange block is clear, the yellow block is clear, the hand is
empty, the red block is on the table, the blue block is on the table, the orange block is on the table, and the yellow block is on
GPT-4 Response the table

[PLAN]
5. Current State: the red block is clear, the blue block is clear, the orange block is clear, the yellow block is clear, the hand is
1. Current State: the red block is clear, the yellow block is clear, the hand is empty, the red block is on top of
empty, the red block is on the table, the blue block is on the table, the orange block is on the table, and the yellow block is on
the blue block, the yellow block is on top of the orange block, the blue block is on the table, and the orange
the table
block is on the table
Action: unstack the red block from on top of the blue block Action: pick up the orange block
Reason: The above action is applicable in the current state because its preconditions; the orange block is clear, the hand is
Reason: The above action is applicable in the current state because its preconditions; the red block is clear,
the hand is empty, and the red block is on top of the blue block, are satisfied in the current state. empty, and the orange block is on the table, are satisfied in the current state.
Resulting State: the blue block is clear, the yellow block is clear, the hand is currently holding red block, the Resulting State: the red block is clear, the blue block is clear, the hand is currently holding orange block, the yellow block is
yellow block is on top of the orange block, the blue block is on the table, and the orange block is on the table clear, the red block is on the table, the blue block is on the table, and the yellow block is on the table

2. Current State: the blue block is clear, the yellow block is clear, the hand is currently holding red block, the 6. Current State: the red block is clear, the blue block is clear, the hand is currently holding orange block, the yellow block is
yellow block is on top of the orange block, the blue block is on the table, and the orange block is on the table clear, the red block is on the table, the blue block is on the table, and the yellow block is on the table
Action: put down the red block Action: stack the orange block on top of the red block
Reason: The above action is applicable in the current state because the precondition; the hand is currently Reason: The above action is applicable in the current state because its preconditions; the red block is clear, and the hand is
holding red block, is satisfied in the current state. currently holding orange block, are satisfied in the current state.
Resulting State: the red block is clear, the blue block is clear, the yellow block is clear, the hand is empty, the Resulting State: the blue block is clear, the orange block is clear, the hand is empty, the yellow block is clear, the orange
red block is on the table, the yellow block is on top of the orange block, the blue block is on the table, and the block is on top of the red block, the blue block is on the table, the red block is on the table, and the yellow block is on the table
orange block is on the table
Final State: the blue block is clear, the orange block is clear, the hand is empty, the yellow block is clear, the orange block is
3. Current State: the red block is clear, the blue block is clear, the yellow block is clear, the hand is empty, the on top of the red block, the blue block is on the table, the red block is on the table, and the yellow block is on the table
red block is on the table, the yellow block is on top of the orange block, the blue block is on the table, and the The goal conditions are satisfied in the final state. Hence, the above plan is valid.
orange block is on the table
[PLAN END]
Action: unstack the yellow block from on top of the orange block
Reason: The above action is applicable in the current state because its preconditions; the yellow block is
clear, the hand is empty, and the yellow block is on top of the orange block, are satisfied in the current state.
Resulting State: the red block is clear, the blue block is clear, the orange block is clear, the hand is currently
holding yellow block, the red block is on the table, the blue block is on the table, and the orange block is on
the table SUCCESS
State Tracking Results
What if we finetuned LLMs with successful
plans in the domain?
Instances correct by
• What if we further finetuned the next word Finetuned GPT-3
(action) completer with a bunch of correct plans
in the domain?
• This is basically the supervised finetuning stage LLMs
currently use to make them better at specific domains 20%
(e.g. Bloomberg's FinGPT..)
• We prepared a dataset comprising the initial 80%
state, goal state, and the respective plan for
1,000 distinct Blocksworld instances.
• By using the default hyperparameters provided
by OpenAI and an 80-20 train-validation data Incorrect Correct
split, we carried out the fine-tuning process.
• Finetuned-GPT3 could only solve around 20%
(122 out of 600) of the test set.
Doing even more fine tuning?
• Pallagani et al *
• Fine-tune (Code-T5) using a
dataset of 18000 plans for each
domain
• Change the representation of
the planning problem into a
more compact representation
for finetuning Table: Evaluation of plan generation capabilities of
LLMs (both prompting pre-trained model and fine-
• Provide PDDL-style prompts to tuned model)
pre-trained models and the
compact representation to the When does it become quixotic
fine-tuned models to give huge number of plans instead of
a small domain model?
Pallagani, V., Muppasani, B., Murugesan, K., Rossi, F., Srivastava, B., Horesh, L., & Loreggia, A. (2023). Understanding the Capabilities of Large Language Models for Automated Planning.
arXiv preprint arXiv:2305.16151.
Training Sequence models (e.g.
Transformers) Tabula Rasa
(without pre-training them on language)

52
Action Vector Models can be used to
Recognize Plans
With the learnt vectors wi, we can predict the target
plan (as the most consistent with the affinities). We
use an EM procedure to speedup the prediction.
• M = |the target plan|

The target plan Learning shallow models


to be recognized can avoid overfitting!!

(a) blocks (b) depots (c) driverlog


0.8" 0.8" 0.8"
DUP"
0.7" 0.7" 0.7"
ARMS+PRP"
0.6" 0.6" 0.6"

0.5" 0.5" 0.5"


accuracy
accuracy

accuracy

0.4" 0.4" 0.4"

0.3" 0.3" 0.3"

DUP" 0.2"
0.2" 0.2"
DUP"
0.1" ARMS+PRP" 0.1" 0.1"
ARMS+PRP"
0"
0.05" 0.1" 0.15" 0.2" 0.25"
0"
0.05" 0.1" 0.15" 0.2" 0.25"
0"
0.05" 0.1" 0.15" 0.2" 0.25"
Nominated for Best Student Paper
percentage of unobserved actions percentage of unobserved actions percentage of unobserved actions
Award at [AAMAS16]
Training Transformers on Plan Sequences
• Uses the autoregressive GPT
transformer architecture to predict
the next token in a reward, state,
action sequence
• Target reward and initial state are
provided as first tokens during
prediction
• Work focuses mainly on action
prediction and did not find improved
performance with reward and state
prediction (although it is possible)
"Decision transformer: Reinforcement learning via sequence modeling." Advances in neural information processing
systems 34 (2021): 15084-15097. Figures are from that paper.
57
Looks Like we showed that
LLMs can’t Plan..?

59
60
On the other hand, the literature seems rife
with claims of LLM planning abilities..

61
62

Why this divide?


Answer: Misunderstandings about what planning involves
64

What Planning is & What LLMs are good at..


Planning (as used in common parlance) Contrasting what AI Planning & LLMs
involves bring to the table
• Planning knowledge • AI Planning (aka ICAPS planning) assumes that
• Actions, preconditions and effects the planning knowledge is given up front, and
• General Recipes: Task reduction schemata (e.g. focuses generation and verification
HTN planning) techniques
• Old examples: Case libraries • Emphasis on guaranteeing
completeness/correctness of the plans w.r.t. the
• Plan generation/verification techniques model
• Interaction analysis/resolution • By and large the common paradigm—although there
have been occasional mutinies
• Plan merging techniques • Model-Lite Planning approaches
• Plan modification techniques
• LLMs, trained as they are on everything ever
put on the web, have a kind of "approximate
LLMs accept any planning problem—even if it not omniscience". This helps them spit out
actions, recipes, or cases
expressible in PDDL standard—and they don’t give
• But they lack the ability to stitch the recipes
any correctness guarantees. together to ensure that there is no actually
interaction free!
AI Planners will give formal guarantees, but only
accept problems expressible in their language.
Are LLMs better at planning if there are no
subgoal interactions?
• Relaxed assessment of GPT-4 plans
• Delete relaxation – Ignoring the delete conditions of all actions
• Precondition relaxation – Ignoring the preconditions of all actions
• Even in the most lenient assessment mode (Delete+Precondition relaxation)
there are still plans (~25%) that are not goal reaching.

65
Then how come LLMs are trumpeted as doing
planning?
• Most cases where LLMs are • And the interaction
claimed to generate executable resolution/search part is
plans, on closer examination, turn • either pushed under the rug
out to be cases where LLMs are • Consider "high level" plans like
"wedding plans" for which there are
getting by with the generate enough generic recipes available in the
training set, and are described at a
approximate recipes step sufficiently high level of abstraction, the
execution issues are left to the user’s
• Generate approximate recipes/cases imagnination
(for common sense domains) • or has been pawed off to
• e.g. wedding plans human prompters who are required to
• Convert tasks into (approximate) task give "hints" to the LLM to come up
reduction schemas with plan variants that are (more)
correct
• Perhaps written out as "programs" • Note that here the human is
(e.g. Code as Policies..) essentially playing the role of an external
• (SHOP2 schemas were already pseudo verifier & critic
lisp code—if only written by humans) • In cases where the humans are end
users not well versed with all details of
• LLM-HTN and LLM-CBR differ from HTN and CBR the domain, they can be faulty verifiers
in that they generate the task-reduction schemas
or the cases on demand
66
67

ReAct, Inner Monologue, ToT Examples

Tree of Thoughts

Most of the ‘planning’ problems that


these works look at don’t require
interaction resolution (no
interactions between subgoals).

ReAct Inner Monologue


What if We Allowed LLMs to
interact with external
planners/verifiers?

68
Expressivity of LLMs is different from
that of LLM Modulo Solvers (i.e.,
LLMs with external plugins)
• LLMs by themselves are broad and shallow
• Cannot give guarantees about their reasoning
• But, they can improve their reasoning etc.
capabilities by making calls to external
solvers
• That themselves are deep and narrow
• The very current “AutoGPT” excitement is
basically about this..
• (Ensuring how a shallow manager can still
provide guarantees via the narrow/expert
workers is a fertile direction of research)
69
Workflow for using LLM’s as Idea Generators
(for External Sound Planners)

LLM’s have universal high-recall (they will never shut up!),


but questionable precision
Automated Planners are guaranteed correct
70
but for planning problems that they can handle
LLMs as heuristics to sound planners
• LLM generated plans might not be directly
useful. Language Plan

Model
• But the plans can be used as heuristic
guidance to drive sound planners.
LPG
• LPG [1]
• A local search planner
• Starts with a seed plan and repairs it iteratively till LLM generated plan as a heuristic to
it gets a correct plan. a sound planner like LPG
• Use LLM generated plans as seed plans to
LPG
• LPG can ‘repair’ these plans to generate
correctness (faster than it would, if it started
with an empty plan)

[1] Gerevini, Alfonso, and Ivan Serina. "LPG: A Planner Based on Local Search for Planning Graphs with Action Costs." AIPS. Vol. 2. 2002.
71
Connection to Case based Planning
• Note that there is an interesting
parallel between this and case based
planning systems—which retrieve an
old plan most relevant to the current
problem and try to modify the plan
• Modification by domain-specific rules
[e.g. CHEF]
• Modification by domain-independent
planners [e.g. PRIAR]

72
Back-prompting from External Verifiers

LLM’s have universal high-recall (they will never shut up!),


but questionable precision
Automated Verifiers are guaranteed correct
73
but for planning problems that they can handle
Automated Back-Prompting with External
Verifiers
• Preliminary experiments show
that back-prompting does
improve LLM’s ability to produce
plans in the Blocks World and
Logistics
• On the average over ~4 feedback
rounds
• The performance in the Mystery
BW still doesn’t improve-–
showing that the connection to
commonsense domains/terms is
critical for LLMs to fake planning
75
LLMs Assisting
Human Planners
• Similar Study with two individual groups
• With LLM assistance – presented LLM plan as plan
suggestion
• Without LLM assistance
• With LLM assistance: 48 human planners, 33
(~69%) of them came up with valid plan.
• Without LLM assistance: 49 human planners,
39 (~80%) of them came up with valid plan.

With LLM assistance Without LLM assistance

20%
31%
69%
80%

Invalid Valid Invalid Valid


Interface at plan writing phase with
No statistical significance in the accuracy, time-taken or the assistance from the LLM
cognitive load between the two groups 76
But, if LLMs can "guess" reasonable
solutions, can't we just call them
multiple times until they generate
correct solution?

Yes, if there is a way to verify and critique the LLM Plans


(..and things get interesting depending on who the source
of verification/critiquing is going to be..)
78
Humans Critiquing LLM Plans
(..and the Clever Hans peril..)
• Humans doing the verification & giving helpful
prompts to the LLM)
• Okay when the humans know the domain and can
correct the plan (with some guarantees)
• Okay for "this essay looks good enough" kind of critiquing
• But for planning, with end users not aware of the domain
physics, the plans that humans are happy with may still not
be actually executable
• When humans know the correct answer (plan) there
is also the very significant possibility of Clever Hans
effect
• Humans unwittingly/unknowingly/non-deliberately giving
important hints

79
80
81
Mechanized Critiquing of LLM plans
• External model-based verifiers doing the critiquing
• A verifier working off of a separate (independently certified) domain model
• VAL for PDDL models
• (Variation) A simulator that can verify/critique the LLM plan
• (Remember: Simulators have to be separately engineered just like domain models)
• Critics can be sound but not complete (special purpose critics)
• Interestingly, the original NOAH planner (1976!!) viewed planning as an iterative criticism of the candidate plan
by a set of sound but incomplete critics (each of which were also expected to give resolutions for the flaws they
found—and they weren't always complete in specifying the resolutions..)
• Much of the latter domain-independent planning basically systematized this process

• "Feedback from the world"


• If it is really world raw in tooth and claw, it could be non-ergodic (failure ends the agent)
• Also, the world doesn't provide helpful critiques.. (that VAL or simulators provide!)
• LLMs verifying their own plans (??)
• Many questionable claims about the "self-debugging abilities" of LLMs (e.g. GPT4)
• Why would an LLM be better at verifying than generating?
• (The usual distinction between effectiveness of generative vs. discriminative classifiers doesn't quite hold here)
• It is possible to fine tune an existing LLM to be a better verifier (..as we shall see)

82
Fine Tuning the Pre-trained model to be both
a generator and verifier
• Start with GPT-2
• [Finetuned generator:] Fine tune GPT-
2 as a generator on a corpus of blocks
world plans
• [Finetuned Verifier:] Use the same
corpus to train a verifier (based off of
GPT-2)
• Do Verifier-augmented generation
• Sort of similar to the back-prompting
with VAL (except that the verifier here is
also learned from the same corpus)

[Arora & Kambhampati, PRL, IJCAI


84 2023]
Making sense of the Open Source Orchestration
Frameworks: AutoGPT, LangChain etc.
• Think “Webservice Orchestration
Frameworks” which allow you to write your
own “agents”
• LLM as the core controller of external
components
• Which in turn is controlled by human prompting
• Safety issues include both safety of the outside
components and safety of the prompt-based
control of LLMs
• They are LLM-modulo-solvers all right, but the
actual orchestration is done with human help
(“language” programming)
• The “planning” part is basically pipelining the
right external services – and is done with human
help
• One core external service they all use is
“external memory” to write into and retrieve
• Because LLMs themselves have no memory
beyond their context window.

Weng, Lilian. (Jun 2023). LLM-powered Autonomous Agents". Lil’Log. https://lilianweng.github.io/posts/2023-06-23-agent/.


LLMs for Extracting Planning
Knowledge
Use LLMs to tease out domain models

Use LLMs to get "task reduction" schemas (recipes for individual tasks)
(That the humans in the loop correct with CoT prompting)

87
What Planning is & What LLMs are good at..
Planning (as used in common parlance) Contrasting what AI Planning & LLMs
involves bring to the table
• Planning knowledge • AI Planning (aka ICAPS planning) assumes that
• Actions, preconditions and effects the planning knowledge is given up front, and
• General Recipes: Task reduction schemata (e.g. focuses generation and verification
HTN planning) techniques
• Old examples: Case libraries • Emphasis on guaranteeing
completeness/correctness of the plans w.r.t. the
• Plan generation/verification techniques model
• Interaction analysis/resolution • By and large the common paradigm—although there
have been occasional mutinies
• Plan merging techniques • Model-Lite Planning approaches
• Plan modification techniques
• LLMs, trained as they are on everything ever
put on the web, have a kind of "approximate
LLMs accept any planning problem—even if it not omniscience". This helps them spit out
actions, recipes, or cases
expressible in PDDL standard—and they don’t give
• But they lack the ability to stitch the recipes
any correctness guarantees. together to ensure that there is no actually
interaction free!
AI Planners will give formal guarantees, but only
88
accept problems expressible in their language.
LLMs for constructing world models

LLM’s have universal high-recall (they will never shut up!),


but questionable precision
Automated Planners are guaranteed correct
90
but for planning problems that they can handle
LLMs for constructing world models
• We utilize LLMs to extract a symbolic
representation of the actions in the
form of PDDL action models
• This intermediate output can be used
with an external domain-independent
planner to reliably search for feasible
plans, or it can be used to validate
and correct "heuristic" plans
generated by an LLM planner.
• We also show that LLMs can also serve
as an interface between PDDL and
any feedback sources that can
provide corrective feedback in
natural language, such as humans and
the PDDL validator in VAL

91
Step 1. PDDL Construction

Incrementally construct the


domain model: our algorithm
generates PDDL models for each
action separately, one at a time, by
iterating over the set of actions.

Newly defined
predicates are
appended to the
list of extracted
predicates

Every time a new predicate is defined, the LLM is required to


give the natural language description of it.
Step 2. Correcting PDDL

Sources of corrective feedback


● Syntax error: PDDL
validations like the one in
VAL
● Factual error: human
domain experts

Corrective feedback is integrated


by replaying and continuing the
PDDL-construction dialogue. Corrective feedback in natural language:

There are some errors in the PDDL model:


1. There is an invalid object type `stove` for the parameter ?s
2. There is an unnecessary precondition “the food to heat is
pickup-able”

Please revise the PDDL model (and the list of predicates if


needed) to fix the above errors (and other potentially similar
errors).
LLMs for constructing world models
• We tested on three domains
1. Household domain
2. Logistics
3. Tyreworld

Example from Household domain Example from Logistics domain


94
LLMs for constructing world models
• We tested on three domains
1. Household domain
2. Logistics
3. Tyreworld

Example from Household domain Example from Logistics domain


95
Code as (Hierarchical) Policies
Original LLM Prompt

LLM Response with Unrecognized Function

Code Parser
Extracts task Prompts again for any missing functions found
reduction Re-prompt
schema from
LLM! LLM Response

Parses new response and recursively


prompts for any missing functions if needed
Liang, Jacky, et al. "Code as policies: Language model programs for embodied control." arXiv preprint
arXiv:2209.07753 (2022). Prompts are from that paper.
96
LLMs as Human Preference Proxies
Can LLMs capture human preferences?

• We investigate the potential of LLMs to serve as


effective human proxies by capturing human
preferences in human-AI collaboration settings.
• LLMs can play different roles in Human-aware AI
interaction: as a Human Proxy, Translator (common
lingua franca), and the Actor.
• Theory of Mind (ToM) requires LLMs to also be
able to capture human mental states, desires, and
beliefs for reward design/learning mechanisms.
• Human-aware AI agents can incorporate such
reward functions to account for human-in-the-
loop’s preferences.
Figure: Different roles of an LLM in Human-AI interaction.

Preference Proxies: Evaluating Large Language Models in capturing Human Preferences in Human-AI Tasks
Mudit Verma*, Siddhant Bhambri*, Subbarao Kambhampati.
Workshop on Theory of Mind in Communicating Agents at ICML 2023 (Oral)
LLMs as Human Preference Proxies
Can LLMs capture human preferences?

Probing LLMs with explicability preferences:


• Under explicability preference, the human expects
the agent to behave in a certain way, and the
agent proactively attempts to model this
expectation and follow it.
• Here, the human takes the role of an observer.

Probing LLMs with sub-task specification


preferences:
• We consider a Human-AI teaming scenario where
the human plays an active role and can perform
actions in the world alongside the AI agent.
• Sub-task specification preferences involve the Figure: Different roles of an LLM in Human-AI interaction.
agent to produce the same set of sub-tasks that
the human has in mind to achieve the team
objective.
LLMs as Human Preference Proxies
Can LLMs capture human preferences? Probing LLMs for
Sub-task Specification Preferences

Probing LLMs for Explicability Preferences

Probing LLMs for


Sub-task Specification Preferences

Table 3: Experiments on testing Theory-of-Mind


capabilities of LLMs across the general
Table 1:Experiments on testing Theory-of-Mind Overcooked game ground truth as per the
capabilities of LLMs across 3 domains: Rover, domain’s provided description, and the user
Fetch and USAR. Y: matches with ground truth, study.
Y∗: matches with ground truth with correct
reasoning, N: does not match with ground
truth, -: no response. TLDR; LLMs showed promise to be used as a
Table 2: Experiments on testing Theory-of-Mind
capabilities of LLMs across 3 Overcooked human proxy. While the earlier LLM models
domain layouts: Asymmetric Advantages, struggled, newer models can perform much
Forced Coordination, and Counter Circuit. better, and human users agree with our findings!
104

LLMs for Format Change


Extracting plans in formal representation from text

Converting text specs of planning problems to formal specs (PDDL/LTL/STL etc)


The actual planning is basically given way to external stand-alone planners
LLMs in the context of decision support
• We have used them in the past
to translate existing documents
in natural language to formal
representations

GPT3-to-Plan: Extracting Plans from Text


[KEPS-21]

105
Text to plan using GPT-3
q Workshop on KEPS (ICAPS’21)
q Workshop on Planning for Financial
Services (ICAPS’21)

• We investigated how GPT-3, one of the most recent


transformer-based language models, can be used to extract
structured actions from natural language texts. We find that
these models achieve comparable, and in some cases better
scores than previous state of the art task-specific methods
• Impact: Existing knowledge in the form of textual procedures
and plans can be translated into formal representations to aid
novice Navy personnel understand and carry out complex
procedures. The translated procedures can also be leveraged
by other automated systems in-place.
106
Using LLMs to translate Goals specified in
Natural Language
• Perhaps the least ambitious way to use LLMs in plan generation is to
just have them convert the goals specified in natural language to
formal representations (..and then use an actual external planner to
solve the planning problem..)
• A bit of a Rube Goldberg approach..
• Examples (not by us) include
• Converting goal specifications to PDDL spec (LLM+P)
• Which basically involves putting in parentheses at the lowest end..
• Converting goal specification to STL spec (AutoTamp)

107
LLMs for Translating Natural
Language Goals to PDDL
• Using LLMs to translate Goal specifications from Natural Language
to PDDL. (LLM as an Interface between Humans and Planners).
• Translation subtasks:
• PDDL Domain Understanding
• Goal Inference
• PDDL Goal Specification.
• LLM does well for unambiguous fully specified goals but has mixed
performance for ambiguous partially specified goals.
• 5 skills tested for - 1)Linguistic competence 2) Object Association
3) Numerical Reasoning 4) Physical Reasoning 5) World
Knowledge
• Results show that LLM has reasonably good Linguistic
Competence, Object Association abilities, and World Knowledge.
However, it has poor Numerical & Physical Reasoning abilities.

Xie, Y., Yu, C., Zhu, T., Bai, J., Gong, Z., & Soh, H. (2023). Translating natural language to planning goals with large-language models. arXiv preprint arXiv:2302.05128. 108
LLM+P: Empowering LLMs with Optimal
Planning Proficiency
• LLM translates Natural Language Problems into Problem PDDL by in-context Learning.
(Context is crucial).
• Planner: Problem PDDL + Domain PDDL --> PDDL Plan
• LLM: PDDL Plan --> Natural Language Plan.
• LLM as a planner:
• Lacks the ability to reason about preconditions.
• Performs poorly in Domains that require an understanding of complex spatial relationships.

Liu, B., Jiang, Y., Zhang, X., Liu, Q., Zhang, S., Biswas, J., & Stone, P. (2023). Llm+ p: Empowering large language models with optimal planning proficiency. arXiv preprint 109
arXiv:2304.11477.
AutoTAMP
• LLMs are being used as translators and
verifiers
• They translate from natural language to
Signal Temporal Logic representation.
• An STL planner is used to come up with
plans.
• Re-prompting technique is used on the LLM
translator and the verifier to improve
performance

Chen, Y., Arkin, J., Zhang, Y., Roy, N., & Fan, C. (2023). AutoTAMP: Autoregressive Task and Motion Planning with LLMs as Translators and Checkers. arXiv preprint arXiv:2306.06531.
Epilogue

112
113
A Summary of LLM roles in Planning we saw
• LLMs to generate plans autonomously
• (Different prompting techniques, fine tuning techniques
• Not that impressive if the plans require interaction resolution
• LLMs to generate plans with the help of external planners/verifiers (LLM-Modulo
setting)
• As heuristic seeds to external planners
• As targets of “back prompts” by external verifiers
• LLMs to extract plan knowledge
• Domain models
• Task reduction schemas
• LLMs as translators (e.g. from English to PDDL)
• Convert plans/recipes in English to formal representations (can be used for extracting
structured knowledge)
• Serve as a glorified natural language interface for specifying goals, problems etc..

114
What Planning is & What LLMs are good at..
Planning (as used in common parlance) Contrasting what AI Planning & LLMs
involves bring to the table
• Planning knowledge • AI Planning (aka ICAPS planning) assumes that
• Actions, preconditions and effects the planning knowledge is given up front, and
• General Recipes: Task reduction schemata (e.g. focuses generation and verification
HTN planning) techniques
• Old examples: Case libraries • Emphasis on guaranteeing
completeness/correctness of the plans w.r.t. the
• Plan generation/verification techniques model
• Interaction analysis/resolution • By and large the common paradigm—although there
have been occasional mutinies
• Plan merging techniques • Model-Lite Planning approaches
• Plan modification techniques
• LLMs, trained as they are on everything ever
put on the web, have a kind of "approximate
LLMs accept any planning problem—even if it not omniscience". This helps them spit out
actions, recipes, or cases
expressible in PDDL standard—and they don’t give
• But they lack the ability to stitch the recipes
any correctness guarantees. together to ensure that there is no actually
interaction free!
AI Planners will give formal guarantees, but only
115
accept problems expressible in their language.
Then how come LLMs are trumpeted as doing
planning?
• Most cases where LLMs are • And the interaction
claimed to generate executable resolution/search part is
plans, on closer examination, turn • either pushed under the rug
out to be cases where LLMs are • Consider "high level" plans like
"wedding plans" for which there are
getting by with the generate enough generic recipes available in the
training set, and are described at a
approximate recipes step sufficiently high level of abstraction, the
execution issues are left to the imagination
• Generate approximate recipes/cases of the user
(for common sense domains) • or has been pawed off to
• e.g. wedding plans human prompters who are required to
• Convert tasks into (approximate) task give "hints" to the LLM to come up
reduction schemas with plan variants that are (more)
correct
• Perhaps written out as "programs" • Note that here the human is
(e.g. Code as Policies..) essentially playing the role of an external
• (SHOP2 schemas were already pseudo verifier & critic
lisp code—if only written by humans) • In cases where the humans are end
users not well versed with all details of
• LLM-HTN and LLM-CBR differ from HTN and CBR the domain, they can be faulty verifiers
in that they generate the task-reduction schemas
or the cases on demand
116
Planning in the age of LLMs
For far too long, there has been a race to LLMs change that—rather drastically!
bottom on the level of knowledge given to
planners
• Planning started knowledge-based • LLM makes it easy to get knowledge without
• Remember, Noah was an HTN planner, y’all! making it look like we are inconveniencing any
• ..and fell to ground propositional level –because it specific human
seemed too unseemly to depend on humans for these • We are just stealing everything humans told each
knowledge-based models other—is all.
• And focus on doing interaction resolution from first
principles • ..as long as you relax the requirement of the
knowledge actually being “correct”
• RL was worse—propositional was too high-level a • ..then again, do you really believe that huge human-
knowledge to ask from humans written models are correct?
• They wanted to say they will learn it all
• And not have humans give any knowledge about the
• So the million dollar qn is: How would you do
domain. They just wanted “SIMULATORS”, planning if you have some doddering know-it-all
• ..and it took for ever to do anything—even with ready to give you any kind of knowledge
simulators • “Actions and effects”
• RL is way darned too inefficient, y’all • “Task reduction schemas”
• “Cases”
• Time for LLM-HTN, LLM-CBR etc. paradigms
• Or even a resurrection of the model-lite planning
dream..
117
Tradi<onal Planning

Model-Lite Planning for the LLM Age..

c
m ic-
No ral

s<
l
Te ric
ra
al

t
PO
de
Te tr

ha
si c

po
et

po

n-
M

oc
as

m
Cl

St
Best Student Underlying System Dynamics
Paper Nominee
[AAMAS 2016] [AAMAS 2015] [AIJ 2017; ICAPS 2014; IJCAI 2009, 2007]

ß Associative/uninterpretable Causal/interpretable à

Ease of learning/acquiring the models

7/10/23 UNCLASSIFIED 118


On the role of Large
Language Models in Planning
Slides @
Subbarao Kambhampati
Karthik Valmeekam, Lin Guan, Matthew Marquez

bit.ly/3NC6vqs

119

You might also like