0% found this document useful (0 votes)
12 views

A Systematic Survey of Automatic Prompt Optimization Techniques

Uploaded by

matheuscyrillo7
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views

A Systematic Survey of Automatic Prompt Optimization Techniques

Uploaded by

matheuscyrillo7
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 31

A Systematic Survey of Automatic Prompt Optimization Techniques

Kiran Ramnath, Kang Zhou, Sheng Guan, Soumya Smruti Mishra, Xuan Qi, Zhengyuan Shen,
Shuai Wang, Sangmin Woo, Sullam Jeoung, Yawei Wang, Haozhu Wang, Han Ding,
Yuzhe Lu, Zhichao Xu, Yun Zhou, Balasubramaniam Srinivasan, Qiaojing Yan, Yueyan Chen,
Haibo Ding, Panpan Xu, and Lin Lee Cheong
Amazon Web Services
{raxkiran,zhoukang,shguan,soumish,xuaqi,donshen, wshui,sangminw,sullamij,
yawenwan, haozhuw, handing, yuzhelu, xzhichao, yunzzhou, srbalasu, qiaojiny,
yyanc, hbding, xupanpan, lcheong}@amazon.com

Abstract not require parameter access on LLMs performing


the task, (2) they systematically search through the
Since the advent of large language models prompt solution space, and (3) they retain human
(LLMs), prompt engineering has been a cru- interpretability of prompt improvements. In this
arXiv:2502.16923v2 [cs.CL] 2 Apr 2025

cial step for eliciting desired responses for vari- survey paper, we aim to highlight the advances in
ous Natural Language Processing (NLP) tasks. the field. Our core contribution is a 5-part APO
However, prompt engineering remains an im-
taxonomy combined with a comprehensive fine-
pediment for end users due to rapid advances
in models, tasks, and associated best practices. grained categorization of various design choices
To mitigate this, Automatic Prompt Optimiza- therein (see Fig. 1, Tables 2, 3, 4 in Appendix). We
tion (APO) techniques have recently emerged hope our framework will be informational for new
that use various automated techniques to help and seasoned researchers alike, enabling further
improve the performance of LLMs on various research on open questions.
tasks. In this paper, we present a comprehen-
sive survey summarizing the current progress 2 Automatic Prompt Optimization
and remaining challenges in this field. We pro- Formulation
vide a formal definition of APO, a 5-part uni-
We formalize the process of automatic prompt op-
fying framework, and then proceed to rigor-
ously categorize all relevant works based on timization (APO) as follows. Given a task model
their salient features therein. We hope to spur Mtask , initial prompt ρ ∈ V , the goal of an APO-
further research guided by our framework. system MAP O is to obtain the best performing
prompt-template ρopt under a metric f ∈ F and
1 Introduction eval-set Dval

Since McCann et al. (2018) cast multi-task NLP ρopt := arg max Ex∼Dval [f (Mtask (ρ ⊕ x))] (1)
ρ∈V
as Question Answering, using prompts as inputs
has become the standard way to elicit desired re- This objective function is not tractable for discrete
sponses from Large Language Models (LLMs). prompt optimization as token-sequence search
Furthermore, LLMs’ few-shot learning (Brown spaces are combinatorial. Instead, APO techniques
et al., 2020), instruction-following (Ouyang et al., follow the general anatomy as described in Algo-
2022), and zero-shot reasoning capabilities (Ko- rithm 1 to obtain approximate solutions.
jima et al., 2023) have led to a widespread prolif- 3 Initialize Seed Prompts
eration of prompting tricks for various tasks and
3.1 Manual Instructions
model variants. However, LLMs still exhibit unpre-
dictable sensitivity to various factors (explanation Several approaches use a seed of manually cre-
of the task (Li et al., 2023b),ordering (Liu et al., ated instructions that offer interpretable and strong
2024a), stylistic formatting (Sclar et al.), etc.) caus- baselines as the basis for further improvement,inter
ing a performance gap between two prompts that alia., ProteGi (Pryzant et al., 2023), GPS (Xu et al.,
are semantically similar, thereby adding impedi- 2022), SPRIG (Zhang et al., 2024b). While ob-
ments for adoption by end users. Against this back- taining quality examples can be costly, APE (Zhou
drop, Black-Box Automatic Prompt Optimization et al., 2022) 1 showed that a few hundred samples
(APO) techniques have emerged that improve task are sufficient for further optimization.
performance via automated prompt improvements. 1
Note: APE stands for Automatic Prompt Engineer method
The possess various attractive features - (1) they do introduced by (Zhou et al., 2022), not to be confused with APO
Manual Instructions §3.1
Seed Prompts §3
Instruction-induction via LLMs §3.2

Task accuracy §4.1.1

Reward model score §4.1.2


Numeric score §4.1
Entropy-based §4.1.3

Negative log-likelihood §4.1.4

Inference evaluation
and feedback §4 Improving single candidate §4.2.1
LLM Feedback §4.2
Improving multiple candidates §4.2.2
Prompt optimization anatomy §2

Human Feedback §4.3

Monte Carlo Sampling §5.1.1

Genetic Algorithm §5.1.2


Heuristic-based
edits §5.1
Word / phrase edits §5.1.3

Vocabulary pruning §5.1.4

Reinforcement Learning §5.2.1


Editing with auxiliary
LLM Finetuning §5.2.2
trained NN §5.2
Generative Adversarial Networks §5.2.3
Candidate prompt
generation §5
Metaprompt design §5.3

Single prompt expansion §5.4.1

Coverage-based §5.4 Mixture of experts §5.4.2

Ensemble methods §5.4.3

Program Synthesis §5.5

TopK Greedy Search §6.1

Upper confidence bound and variants §6.2


Filter and retain
promising candidates §6
Region-based joint search §6.3

Meta-heuristic ensemble §6.4

Fixed steps §7.1


Iteration depth §7
Variable steps §7.2

Figure 1: Taxonomy of Automatic Prompt Optimization

Algorithm 1 Prompt optimization framework 3.2 Instruction Induction via LLMs


1: P0 := {ρ1 , ρ2 , . . . , ρk } ▷ §3. Seed prompts Honovich et al. (2023) were the first to propose
2: Dval := {(x1 , y1 )}ni=1 ▷ Validation set inducing LLMs to infer human-readable prompts
3: f1 , . . . , fm ∈ F ▷ §4. Inference evaluation based on a few demonstrations E (see Appendix
4: for t = 1, 2, . . . , N do ▷ §7. Iteration depth 14.1 for prompt). APE (Zhou et al., 2022) and
▷ §5. Generate prompt candidates DAPO (Yang et al., 2024c) use the induced seed
5: Gt := MAP O (P, Dval , F ) instructions for further optimization, while MOP
▷ §6. Filter and retain candidates (Wang et al., 2025) and GPO (Li et al., 2023c) use
6: Pt := Select(Gt , Dval , F ) APE to induce cluster-specific prompts. Apart from
▷ §7. Optionally check for early convergence demonstrations, SCULPT (Kumar et al., 2024) in-
7: if fconvergence ≤ ϵ then duced instructions from task-READMEs, while
8: exit UniPrompt (Juneja et al., 2024) used LLMs to fill-
9: return arg maxρ∈PN Ex∼Dval [f (Mtask (ρ ⊕ x))]
which broadly refers to the entire area of Automatic Prompt
Optimization
Figure 2: Representative APO system

in structured templates. 4.1.3 Entropy-based Scores


4 Inference Evaluation and Feedback Entropy-based scores evaluate the entire output
The evaluation step helps identify promising distribution induced by candidates, as opposed to
prompt candidates in each iteration. Some methods a single inference instance. They are gradient-
also use LLM feedback on prompt-response pairs free but require access to the entire output prob-
to help generate more prompt candidates. ability distribution, something not usually possi-
ble with black-box LLMs. CLAPS (Zhou et al.,
4.1 Numeric Score Feedback 2023) leverages the negative incremental cross-
4.1.1 Accuracy entropy of π(xi ⊕v∈V ) v/s π(xi ) to identify promis-
ing words v ∈ V to add to the prompt. The topK
Using task-specific accuracy metrics is the most
words are then used as candidate tokens from which
straightforward and widespread way of eliciting
to construct candidate prompts. GRIPS (Prasad
feedback, i.a., (Zhou et al., 2022, 2023; Zhang
et al., 2023) simply added P an entropy term to
et al., 2024b; Khattab et al., 2022). Classifica-
the Ptask-weighted accuracy − πρ (y) ln(πρ (y))+
tion and MCQ-based QA tasks use exact accuracy, 1
|T | 1(y = ŷ) to prioritize output diversity in po-
while code-related tasks measure execution accu-
tential prompt candidates.
racy. Text generation tasks (summarization, transla-
tion, creative writing) employ flexible metrics like 4.1.4 Negative Log-likelihood of Output
BLEU-N, Rouge-N, Rouge-N-F1, or embedding- Some approaches like APE, GPS (Xu et al., 2022),
based measures such as BERTScore (Zhang* et al., PACE (Dong et al., 2024b) consider the negative
2020) (Honovich et al., 2023; Dong et al., 2024b). log-likelihood (NLL) of token sequences under the
4.1.2 Reward-model Scores target LLM, i.e., − log(πρ (y)). This however re-
quires the log-probabilities to be accessible during
Given the limitations of rigid accuracy metrics,
the decoding of each token, limiting its applica-
some approaches proposed using learned reward
bility. The NLL for ground truth one-hot token-
models to provide more nuanced evaluations of
sequence is equivalent to the cross-entropy.
prompts-response pairs (Deng et al., 2022; Sun
et al., 2024a; Kong et al., 2024). OIRL (Sun et al., 4.2 LLM Feedback
2024a) trained an XGBoost-based reward model A popular paradigm to augment or fully replace
that takes query-prompt embedding pairs as input numeric scores is to use textual feedback generated
and predicts whether the prompt will elicit correct by LLMEvaluator (Wang et al., 2024a; Long et al.,
answers from the language model and use it to se- 2024; Sinha et al., 2024). It is versatile because
lect appropriate prompts for specific queries using it can evaluate both the response as well as the
a best-of-N strategy. DRPO (Amini et al., 2024) prompt input. It can directly aid the prompt rewrit-
follows an LLM-based reward modeling approach ing process while being flexible to individual tasks
using both predefined and dynamic reward criteria. as it only needs natural language instructions for
It first optimizes in-context learning examples E, general-purpose LLMs as opposed to task-specific
and using that it optimizes the specific task prompt. handcrafting of metrics. A potential downside is
the inference cost incurred due to an additional
LLM call. All the LLM feedback approaches pro-
Paper Seed instructions Iteration depth Inference evaluation Candidate generation Search+filter strategy
ProTeGi (Pryzant Manually created Fixed LLM feedback + LLM rewriter UCB for trees
et al., 2023) Task accuracy
APE (Zhou et al., Instruction induction Fixed Task accuracy N/A UCB
2022)
CRISPO (He et al., Manually created Fixed LLM feedback + LLM rewriter TopK selection
2025) Task accuracy
MOP (Wang et al., Instruction induction Fixed Task accuracy Mixture of experts Region-based
2025) joint search
DSPY (Khattab Manually created + Variable LLM feedback + Program Synthesis TopK selection
et al., 2024) Instruction induction Task accuracy
OPRO (Yang et al., Manually created Variable LLM feedback + Metaprompt design TopK selection
2024a) Task accuracy
GATE (Joko et al., Manually created Variable Human feedback LLM rewriter N/A
2024)

Table 1: Comparison of some APO techniques under our framework (Tables 2,3,4 show full comparison)

vide multiple feedback data and broadly fall into to generate several prompt candidates for evalu-
two categories - improving a single prompt candi- ation in the next iteration. PromptAgent (Wang
date versus improving multiple prompt candidates et al., 2024a) similarly used an error collection ap-
(discussed below, examples in Appendix 14.3). proach to emulate expert-written prompts that con-
sisted of clear sections like “Task description”, “Do-
4.2.1 Improving Single Candidate
main Knowledge”, “Solution Guidance”, “Excep-
SCULPT (Kumar et al., 2024) introduces a system- tion Handling”, “Output Formatting”. PREFER
atic method for tuning long, unstructured prompts (Zhang et al., 2024a) utilizes a feedback-reflect-
by employing a hierarchical tree structure and refine cycle to aggregate feedback into multiple
two-step feedback loops - preliminary assessment prompts in an ensemble to improve the model’s
and error assessment - to evaluate and correct ability to generalize across various tasks. Sur-
prompts before and after execution. The feed- vival of the Safest (SOS) (Sinha et al., 2024) added
back updates the hierarchical prompt tree which is safety-score into a multi-objective prompt opti-
then back-synthesized into a new prompt candidate. mization framework that used an interleaved strat-
PACE (Dong et al., 2024b) applies an actor-critic egy to balance performance and security in LLMs
editing framework to the prompt refinement pro- simultaneously. To avoid accidentally damaging
cess itself, allowing for more dynamic and adaptive well-functioning prompts, StraGo (Wu et al., 2024)
adjustments. Overcoming the limitations of opti- summarized strategic guidance based on both cor-
mizing a single metric, CRISPO (He et al., 2025) rect and incorrect predictions as feedback.
adopts a multi-aspect critique-suggestion meta-
4.3 Human-feedback
prompt to highlight flaws in the generated response
across multiple dimensions such as style, preci- A few works also incorporate human feedback, ei-
sion, and content alignment. Thereafter it leverages ther during compile-time or inference-time in the
detailed, aspect-specific feedback and iteratively prompt construction / optimization process. Joko
updates the prompts. Autohint (Sun et al., 2023) et al. (2024) proposed “Generative Active Task
summarizes feedback for multiple incorrect infer- Elicitation” to better capture human preferences.
ences via hints to instill improvements into a single It prompts a language model to interactively ask
prompt candidate. questions and infer human preferences conditioned
on the history of free-form interaction. Cheng et al.
4.2.2 Improving Multiple Candidates
(2024) trained a smaller LLM to optimize input
ProTeGi (Pryzant et al., 2023) and TextGrad (Yuk- prompts based on user preference feedback, achiev-
sekgonul et al., 2024) leverage textual “gradients” ing up to 22% increase in win rates for ChatGPT
to guide the discrete prompt optimization proce- and 10% for GPT-4. PROMST (Chen et al., 2024)
dure, very similar to the gradient-descent style of tackles the challenges of multi-step tasks by in-
continuous prompt optimization approaches. Dif- corporating human-designed feedback rules and a
ferent from continuous gradient-descent, ProTeGi learned heuristic model. APOHF (Lin et al., 2024)
sampled multiple “gradients” i.e. directions of focuses on optimizing prompts using only human
improvement, and each such “gradient” is used preference feedback rather than numeric scores,
employing a dueling bandits-inspired strategy to needing repeated task-specific optimizations.
efficiently select prompt pairs for preference feed-
LLM-based mutation: LMEA (Liu et al., 2023),
back, proving effective for tasks like text-to-image
SOS (Sinha et al., 2024), and StraGo (Wu et al.,
generation and response optimization.
2024) uses mutation prompts with LLMs to over-
5 Candidate Prompt Generation come the traditional complexity of designing tai-
In this step, one or more candidate prompts are gen- lored operators for cross-over / mutation. Prompt-
erated that are most likely to result in an improve- Breeder (Fernando et al., 2023) advocates self-
ment in a metric of interest f ∈ F . The approaches referential improvement of all prompts in the
reviewed below range from simple rule-based ed- prompt optimization system - Direct Mutation of
its (sec. 5.1) to sophisticated agentic systems that task prompts, Hypermutation of mutation prompts
combine with LLM-based evaluations (sec. 4.2) themselves, Lamarckian Mutation where prompts
and various filtering strategies (sec. 6). are reverse-engineered from successful examples
(similar to Instruction Induction Honovich et al.
5.1 Heuristic-based Edits (2023), and finally Crossover and Shuffling to im-
Several works proposed heuristic-based mecha- prove diversity of the prompt pool. EvoPrompt
nisms to make edits to intermediate prompt can- (Guo et al., 2024) use Differential Evolution -
didates to generate newer candidates. They range where differences between existing prompts is in-
from edits at the word / phrase / sentence-level corporated to form new prompt candidates to over-
(either simple rule-based or LLM-generated), or come the problem of local optima. AELP (Hsieh
metric-driven incremental search. While these et al., 2024) also uses mutation operators to per-
strategies may not result in the most optimal so- form sentence-level edits in an iterative fashion.
lution, they help in making the discrete prompt They include sentence-level histories of reward
optimization problem computationally tractable. {(st−1 , st , rt )} in the mutation prompt in order
to avoid local optima and accidentally returning
5.1.1 Monte Carlo Sampling
to sub-optimal versions. GPS (Xu et al., 2022)
ProTeGi (Pryzant et al., 2023) uses Monte carlo used Back-translation, Sentence Continuation, and
sampling to explore combinatorial discrete solution Cloze transformations to perform prompt mutation.
spaces in an incremental fashion - it samples multi- PromptWizard (Agarwal et al., 2024) proposed a
ple textual gradients to use to generate prospective pipeline combining several steps including itera-
candidates, and spawns paraphrases as monte-carlo tive improvement, few shot example synthesis and
successors for evaluation. PromptAgent (Wang selection, utilizing LLM’s reasoning capability to
et al., 2024a) uses a tree-variant called Monte Carlo improve and validate the prompt, and finally an
Tree Search (MCTS) which consists of 4 steps — expert persona to ensure consistency of the style of
Selection, Expansion, Simulation, and Backpropa- generated prompts.
gation (also explained in Sec. 6).
5.1.3 Word / Phrase Level Edits
5.1.2 Genetic Algorithm
Several word-edit approaches first identify "influ-
A significant line of work applies the well-studied ential" tokens in the prompts. COPLE (Zhan et al.,
genetic algorithms to make discrete edits to texts. 2024) argued that LLMs exhibit lexical sensitivity,
The common recipe for several genetic algorithms showing that merely replacing a few words with
is 1/ Mutate and 2/ Cross-over components from their synonyms can yield significant improvements.
promising candidates. Token mutations: SPRIG First, “influential” tokens are identified where ex-
(Zhang et al., 2024b) and CLAPS perform token- pected loss on dev-set EDval [L(y, ŷ)] drops the
level mutations. SPRIG uses a starting corpus of most after removing that token versus the original
300 components grouped into categories like COT, prompt, and then influential tokens are replaced
roles, styles, emotions, scenarios, and good prop- using predictions from a Masked-Language Mod-
erties. It performs add/rephrase/swap/delete, high- els. This token-replacement approach is also at-
lighting complementary strengths of optimizing tractive as a standalone post-processing step for
system prompts alongside task-prompts (via meth- long prompts that are already optimized using other
ods like ProTeGi) to enhance accuracy across mul- LLM-based approaches. GRIPS (Prasad et al.,
tiple diverse domains, languages, and tasks without 2023) argues that phrase level edition is an effec-
tive and interpretable method to optimize prompts, perform prompt optimizations to preserve privacy
leveraging 4 basic edit operations -add, delete, para- and adapt to target models better leveraging both
phrase, and swap data diversification and strategic fine-tuning such
as SFT, preference optimization, and iterative pref-
5.1.4 Vocabulary Pruning
erence learning.
Some works prune the vocabulary space V to
5.2.3 Generative Adversarial Networks
Vpruned for decoding the next token for the op-
timized prompt ρ∗ . CLAPS (Zhou et al., 2023) Long et al. (2024) framed the prompt optimization
argued that general search spaces are highly re- process in the GAN setting. The LLM generator
dundant and use K-means clustering to find word- takes question and the generation prompt to pro-
clusters and retain top-2000 words closest to cluster duce output. The (input, output) pairs are evaluated
centroids. BDPL (Diao et al., 2022) used pairwise by an LLM powered discriminator, whose goal is
mutual information (PMI) to retain top co-occuring to identify generated pairs from ground truth pairs.
ngrams for decoding. PIN (Choi et al., 2024) in- Both generator and the discriminator are jointly op-
stead added regularization in the form of Tsallis- timized using adversarial loss, by utilizing a prompt
entropy (ideal for heavy-tailed distributions like modifier LLM to rewrite their prompts.
natural language) for the RL training of a prompt 5.3 Metaprompt Design
generation network, to reduce the probability mass
for unlikely tokens and improve interpetability. PE2 (Ye et al., 2024) argued that previous works
under-explored meta-prompt search space. OPRO
5.2 Editing via Auxiliary Trained NN (Yang et al., 2024a) proposes a meta-prompt design
Some approaches leverage a trained auxiliary neu- (see Appendix 14.2) which includes the optimiza-
ral network to edit the initial prompt for ob- tion problem description in natural language and
taining desired improvements. We include ap- previously generated solutions (multiple solutions
proaches where the finetuned network is different per stage for diversity) and scores alongisde the
and smaller than the task network. meta-instruction for prompt refinement. DAPO
(Yang et al., 2024c) utilizes a well-designed meta-
5.2.1 Reinforcement-learning
instruction to guide the LLM in generating high-
Multi-objective Optimization techniques (Jafari quality and structured initial prompts (contain task-
et al., 2024) demonstrate superiority over simple specific info, e.g. task type and description, output
reward averaging, particularly through volume- format and constraints, reasoning process, profes-
based methods that effectively balance competing sional tips) by observing given input-output ex-
objectives. Dynamic prompt modification strate- emplars. Then, DAPO iteratively optimizes the
gies, introduced through prompt rewriting (Kong prompts at the sentence level, leveraging previous
et al., 2024), directional stimulus prompting (Li tuning experience to expand prompt candidates.
et al., 2023d) and test-time editing (Zhang et al.,
5.4 Coverage-based
2022) solve the important goal of moving beyond
static prompt generation. Prompt-OIRL (Sun et al., Some approaches seek to "cover" the entire prob-
2024a) also tackled test-time optimization objec- lem space - either within a single prompt, or using
tive by learning an offline reward model and multiple prompts working individually or in an en-
subsequently using a best-of-N strategy to recom- semble during inference.
mend the optimal prompt in a query-dependent 5.4.1 Single Prompt-expansion
fashion. BDPL (Diao et al., 2022) optimized dis-
crete prompts using variance-reduced policy gradi- AMPO (Yang et al., 2024d) uses LLM feedback
ent algorithm to estimate gradients, allowing user to enumerate all the failure cases based on the
devices to fine-tune tasks with limited API calls. evaluation-set Dval and then enlists each of them in
the meta-instruction in an if-then-else format using
5.2.2 Finetuning LLMs 3 modules - 1/ Pattern Recognition, 2/ Branch Ad-
BPO (Cheng et al., 2024) trains a smaller 7B model justment, and 3/ Branch Pruning to decide whether
to align itself to task-performance on individual to enhance existing branches, or to grow new
LLMs using reward-free alignment. FIPO (Lu branches. Similarly, UNIPROMPT focused on ex-
et al., 2025) trains a local model (7B - 13B) to plicitly ensuring that various semantic facets of a
task get represented in the final prompt. It designs a (Opsahl-Ong et al., 2024) automates the optimiza-
human-like (manual) prompt engineering approach tion of multi-stage language model programs by
(UniPrompt) with two stages: a) task facets ini- improving instructions and demonstrations for each
tialization using background knowledge, and b) module. SAMMO (Schnabel and Neville, 2024)
refinement using examples. proposed symbolic prompt programming, repre-
senting prompts as directed-acyclic-graphs (DAG).
5.4.2 Mixture of Experts
A set of user-defined node mutation rules guide the
Wang et al. (2025) introduced the Mixture-of- mutation-search to find the optimal DAG, which is
Expert-Prompts where each expert is a task-prompt then converted back to a prompt.
to be used for specialized inference. MOP first
6 Filter and Retain Promising Prompts
clusters all demonstrations using K-means cluster-
ing. Then, the Region-based Joint Search (RBJS) In this step, promising prompt candidates are fil-
(sec.6.3) algorithm generates the appropriate in- tered for further optimization.
struction for each exemplar-cluster via instruction 6.1 TopK Greedy Search
induction (sec.3.2) based on a mix of in-cluster
and out-of-cluster demonstrations to cover “blind- The simplest mechanism to iteratively search
spots”. During inference, a single expert prompt is through prompt candidate sets is a greedy topK
invoked whose cluster centroid µc is closest to the search where in each iteration of the optimiza-
instance-embedding arg minC ||ϕ(xi ) − µc ||2 . tion, the top-K best-performing candidates on mini-
batch of data instances Dval are retained for further
5.4.3 Ensemble Methods iterations (e.g. - ProTeGi, AELP. This differs from
PromptBoosting (Hou et al., 2023), Boosted- beam-search which judges partial solutions’ based
Prompting (Pitis et al., 2023), PREFER (Zhang on the reward for the entire trajectory of prompt
et al., 2024a), etc. are ensemble methods that in- edits r({ρ11 , ρ12 , . . . , ρ1t }).
voke multiple prompts during inference and com- 6.2 Upper Confidence Bound and Variants
bine them to generate the final output ŷ = y0 +
Relying on a single static evaluation dataset can
Σm βi yi . GPO (Li et al., 2023c) also uses labeled
lead to biases in the selection procedure and finally
source data to generate an ensemble of prompts,
suboptimal solutions. ProTeGi, SPRIG, inter alia,
which are applied to unlabeled target data to gener-
cast the candidate prompt selection problem as that
ate output through majority voting.
of bandit search - identifying the most suitable
5.5 Program Synthesis arm (prompt candidate) operating on a fixed com-
Program-synthesis based approaches transform putation budget. They use the Upper Confidence
LLM pipelines into structured, modular compo- Bounds (UCB, Algorithm 2) which balances explo-
nents that can be systematically optimized and ration with exploitation. In each iteration of prompt
composed. These optimization techniques itera- optimization, they sample a different evaluation
tively refine instructions and demonstrations for dataset Dsample ∈ Dval , and maintain a moving
each module to improve the entire pipeline’s per- estimate of the optimality of each arm (i.e. prompt).
formance, DSP (Khattab et al., 2022) introduces In each iteration, the playout filters top-B prompt
a three-stage framework for retrieval-augmented candidates with the greatest score for further ex-
inference: Demonstrate (generates task-specific ploration. PromptAgent uses a variation of UCB
demonstrations), Search (retrieves relevant infor- called UCB for Trees (UCT) which are used in the
mation), and Predict (combines retrieved info with setting of contextual bandits (i.e. the action-space
demonstrations). DSPY (Khattab et al., 2024) and the reward function is state-dependent). AELP
transforms LLM pipelines into text transformation (Hsieh et al., 2024) used a modification called Lin-
graphs - introducing parameterized models, learn- ear UCB (Li et al., 2010) which uses a closed form
ing through demonstrations, and a compiler that op- linear estimate based on the reward trajectories of
timizes pipelines. DLN (Sordoni et al., 2023) simi- previously sampled edits as well as prompt embed-
larly considers chained LLM calls as stacked deep ding ϕ(s) to select the next best-arm.
language networks performing variational infer-
ence, where the learnable parameters for each layer
are task-decomposed prompt templates. MIPRO
6.3 Region-based Joint Search settings. Barring a few tasks covered by Joko et al.
MOP (Wang et al., 2025) proposes a Mixture- (2024); Sun et al. (2024a); Zhang et al. (2022);
of-Expert-Prompts performing prompt optimiza- Choi et al. (2024), inference-time optimization of
tion for each expert individual. Once C exemplar- multiple unknown tasks is underexplored. More
clusters are identified, the RBJS search first sam- robust evaluations are needed for task-agnostic
ples examples Dexemplars ∈ DC ∪ D \ DC , and APO systems combining seen and unseen tasks.
then uses APE to induct and optimize each expert 9.2 Unclear Mechanisms
instruction. Melamed et al. (2024) showed that prompts have
6.4 Metaheuristic Ensemble so-called ’evil twins’ that are uninterpretable yet
PLUM (Pan et al., 2024) library offered a meta- recover some of the performance of gold-standard
heuristic ensemble of different search algorithms prompts. Lu et al. (2024) showed that rare gib-
like Hill climbing, Simulated Annealing, Genetic berish strings can serve as competitive delimiters
Algorithms, Tabu Search, and Harmony Search. τ in prompts. Yang et al. (2024b) showed that
self-reflection by LLMs can suffer from incorrect
7 Iteration Depth error identification, prior biases, semantic invalid-
7.1 Fixed Steps ity, leading to failure in yielding improved prompts.
More studies are needed to better uncover the mech-
Most approaches choose to carry out the prompt
anisms of prompt optimization.
optimization for a fixed number of steps N.
9.3 APO for System Prompts / Agents
7.2 Variable number of steps
Although SPRIG explored optimizing system
GRIPS (Prasad et al., 2023) concludes search when
prompts in chat-style settings, scalability remains
successive iterations with negative gains breach
a challenge - optimizing system prompts required
a patience parameter, whereas PromptAgent con-
a predefined corpus and close to 60 hours whereas
cluded APO when rt ≤ ϵmin ∨ rt ≥ ϵmax .
Protegi only needed 1̃0 minutes per task. Similarly,
8 Theoretical Perspectives optimizing prompts for several components in an
agentic system in a concurrent fashion poses an
8.1 Upper Bound of Improvement from APO
exciting direction for future research.
AlignPro (Trivedi et al., 2025) establishes an upper
9.4 Multimodal APO
bound on the gains realizable from discrete prompt
optimization under a given prompt optimizer and Recently, textual prompt optimization has ex-
also a suboptimality-gap w.r.t. RLHF-optimal pol- panded to multimodal domains: text-to-image (Liu
icy π ∗ , while a lower bound is left unexplored. et al., 2024b; Mañas et al., 2024; Liu et al., 2024d),
text-to-video (Ji et al., 2024), text-to-audio (Huang
8.2 Other Related Perspectives
et al., 2023), and text-image alignment models like
Bhargava et al. (2024) proposed a control theo- CLIP (Du et al., 2024; Mirza et al., 2024). Be-
retic framework to establish bounds on the set of yond textual prompts, Huang et al. (2023) explore
reachable LLM-outputs for self-attention in terms optimizing multimodal inputs, such as images, to
of the singular values of its weight matrices. Liu elicit better responses from large multimodal mod-
et al. (2024c) showed the existence of a strong els. However, the interplay between modalities in
transformer that can approximate any sequence-to- prompt optimization remains underexplored. Fu-
sequence Lipschitz function. They also showed the ture research could develop APO frameworks to
existence of “difficult” datasets that depth-limited jointly optimize multimodal prompts (eg - remove
transformers could not commit to memory. background noise from audio, add visual markers
9 Challenges and Future Directions to videos, etc.) to fully leverage their synergies.

9.1 Task-agnostic APO 10 Conclusion


All the surveyed APO methods assume that the task In this paper, we provide a comprehensive fine-
type T is known beforehand; additionally offline grained review of existing APO techniques and
APO methods also require an evaluation set Dval , identified key areas for future growth. It is our aim
something not explicitly available in production to spur future research spawning from our survey.
11 Limitations Iwaszuk, Smriti Jha, Dan Klein, Jayant Krishnamurthy,
Theo Lanman, Percy Liang, C. H. Lin, Ilya Lintsbakh,
While we attempted to cover all qualifying papers, Andy McGovern, Aleksandr Nisnevich, Adam Pauls,
it is possible that we may have unintentionally Dmitrij Petters, Brent Read, Dan Roth, Subhro Roy,
missed out on some relevant papers. We also men- Jesse Rusak, Beth Ann Short, Div Slomin, B Snyder,
tion some of the papers that were excluded in this Stephon Striplin, Yu Su, Zachary Tellman, Sam Thom-
son, A. A. Vorobev, Izabela Witoszko, Jason Wolfe,
survey with specific reasons in section 12.2. Also, A. G. Wray, Yuchen Zhang, and Alexander Zotov. 2020.
we realize that fitting varied research works into a Task-oriented dialogue as dataflow synthesis. Transac-
single unifying framework might risk broad catego- tions of the Association for Computational Linguistics,
rizations for some papers, or skipping some char- 8:556–571.
acteristics for others (e.g. Tempera (Zhang et al., Trapit Bansal, Rishikesh Jha, and Andrew McCallum.
2022) consists of both RL-based and word/phrase- 2019. Learning to few-shot learn across diverse natural
level editing techniques, applied to both instruc- language classification tasks. In International Confer-
ence on Computational Linguistics.
tions and exemplars). In such cases, we categorize
a paper based on its most salient features. Another Aman Bhargava, Cameron Witkowski, Shi-Zhuo Looi,
challenge is that when presenting a survey paper and Matt Thomson. 2024. What’s the magic word? a
control theory of llm prompting.
under 8 pages, we had to make tradeoffs and only
retain content in the main body that was deemed Yonatan Bisk, Rowan Zellers, Ronan Le Bras, Jianfeng
most necessary. This resulted in having to relegate Gao, and Yejin Choi. 2019. Piqa: Reasoning about
a core contribution (Tables 2,3,4) which contained physical commonsense in natural language. In AAAI
Conference on Artificial Intelligence.
a rigorous comparison of all the surveyed papers
into the appendix. We have attempted our best Samuel R Bowman, Gabor Angeli, Christopher Potts,
to strike the right balance between specificity and and Christopher D Manning. 2015. A large annotated
corpus for learning natural language inference. arXiv
brevity to present a novel framework. We also pro- preprint arXiv:1508.05326.
vide copious references to interested researchers
for further reading. Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie
Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind
Neelakantan, Pranav Shyam, Girish Sastry, Amanda
References Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen
Krueger, Tom Henighan, Rewon Child, Aditya Ramesh,
Eshaan Agarwal, Joykirat Singh, Vivek Dani, Raghav Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christo-
Magazine, Tanuja Ganu, and Akshay Nambi. 2024. pher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin,
Promptwizard: Task-aware prompt optimization frame- Scott Gray, Benjamin Chess, Jack Clark, Christopher
work. Berner, Sam McCandlish, Alec Radford, Ilya Sutskever,
and Dario Amodei. 2020. Language models are few-
Fernando Alva-Manchego, Louis Martin, Antoine Bor- shot learners.
des, Carolina Scarton, Benoît Sagot, and Lucia Specia.
2020. Asset: A dataset for tuning and evaluation of Paweł Budzianowski, Tsung-Hsien Wen, Bo-Hsiang
sentence simplification models with multiple rewriting Tseng, Iñigo Casanueva, Stefan Ultes, Osman Ramadan,
transformations. In Proceedings of the 58th Annual and Milica Gasic. 2018. Multiwoz-a large-scale multi-
Meeting of the Association for Computational Linguis- domain wizard-of-oz dataset for task-oriented dialogue
tics, pages 4668–4679. modelling. In Proceedings of the 2018 Conference on
Empirical Methods in Natural Language Processing,
Afra Amini, Tim Vieira, and Ryan Cotterell. 2024. Di- pages 5016–5026.
rect preference optimization with an offset. In Findings
of the Association for Computational Linguistics: ACL Daniel Matthew Cer, Mona T. Diab, Eneko Agirre, Iñigo
2024, pages 9954–9972, Bangkok, Thailand. Associa- Lopez-Gazpio, and Lucia Specia. 2017. Semeval-2017
tion for Computational Linguistics. task 1: Semantic textual similarity multilingual and
crosslingual focused evaluation. In International Work-
R. Anantha, Svitlana Vakulenko, Zhucheng Tu, S. Long- shop on Semantic Evaluation.
pre, Stephen G. Pulman, and Srinivas Chappidi. 2020.
Open-domain question answering goes conversational Mauro Cettolo, Marcello Federico, Luisa Bentivogli,
via question rewriting. In North American Chapter of Niehues Jan, Stüker Sebastian, Sudoh Katsuitho,
the Association for Computational Linguistics. Yoshino Koichiro, and Federmann Christian. 2017.
Overview of the iwslt 2017 evaluation campaign. In In-
Jacob Andreas, Johannes Bufe, David Burkett, ternational Workshop on Spoken Language Translation.
Charles C. Chen, Joshua Clausman, Jean Crawford,
Kate Crim, Jordan DeLoach, Leah Dorner, Jason Eis- Yongchao Chen, Jacob Arkin, Yilun Hao, Yang Zhang,
ner, Hao Fang, Alan Guo, David Leo Wright Hall, Nicholas Roy, and Chuchu Fan. 2024. PRompt opti-
Kristin Delia Hayes, Kellie Hill, Diana Ho, Wendy mization in multi-step tasks (PROMST): Integrating
human feedback and heuristic-based sampling. In Pro- Ido Dagan, Oren Glickman, and Bernardo Magnini.
ceedings of the 2024 Conference on Empirical Meth- 2005. The pascal recognising textual entailment chal-
ods in Natural Language Processing, pages 3859–3920, lenge. In Machine Learning Challenges Workshop.
Miami, Florida, USA. Association for Computational
Linguistics. Marie-Catherine de Marneffe, Mandy Simons, and Ju-
dith Tonhauser. 2019. The commitmentbank: Investi-
Jiale Cheng, Xiao Liu, Kehan Zheng, Pei Ke, Hongn- gating projection in naturally occurring discourse.
ing Wang, Yuxiao Dong, Jie Tang, and Minlie Huang.
2024. Black-box prompt optimization: Aligning large Mingkai Deng, Jianyu Wang, Cheng-Ping Hsieh, Yihan
language models without model training. In Proceed- Wang, Han Guo, Tianmin Shu, Meng Song, Eric P. Xing,
ings of the 62nd Annual Meeting of the Association for and Zhiting Hu. 2022. Rlprompt: Optimizing discrete
Computational Linguistics (Volume 1: Long Papers), text prompts with reinforcement learning.
pages 3201–3219, Bangkok, Thailand. Association for
Computational Linguistics. Franck Dernoncourt and Ji Young Lee. 2017. Pubmed
200k rct: a dataset for sequential sentence classification
Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, in medical abstracts. In International Joint Conference
Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan on Natural Language Processing.
Zhuang, Yonghao Zhuang, Joseph E Gonzalez, et al.
2023. Vicuna: An open-source chatbot impressing gpt- Robert C. Detrano, András Jánosi, Walter Steinbrunn,
4 with 90%* chatgpt quality. See https://vicuna. lmsys. Matthias Emil Pfisterer, Johann-Jakob Schmid, Sarbjit
org (accessed 14 April 2023), 2(3):6. Sandhu, Kern Guppy, Stella Lee, and Victor Froelicher.
1989. International application of a new probability
Minje Choi, Jiaxin Pei, Sagar Kumar, Chang Shu, and algorithm for the diagnosis of coronary artery disease.
David Jurgens. 2023. Do llms understand social knowl- The American journal of cardiology, 64 5:304–10.
edge? evaluating the sociability of large language mod-
Shizhe Diao, Zhichao Huang, Ruijia Xu, Xuechun Li,
els with socket benchmark. In Proceedings of the 2023
Yong Lin, Xiao Zhou, and Tong Zhang. 2022. Black-
Conference on Empirical Methods in Natural Language
box prompt learning for pre-trained language models.
Processing, pages 11370–11403.
arXiv preprint arXiv:2201.08531.
Yunseon Choi, Sangmin Bae, Seonghyun Ban, Min- Rezarta Islamaj Doğan, Robert Leaman, and Zhiyong
chan Jeong, Chuheng Zhang, Lei Song, Li Zhao, Jiang Lu. 2014. Ncbi disease corpus: a resource for disease
Bian, and Kee-Eung Kim. 2024. Hard prompts made name recognition and concept normalization. Journal
interpretable: Sparse entropy regularization for prompt of biomedical informatics, 47:1–10.
tuning with rl.
William B. Dolan and Chris Brockett. 2005. Automat-
Christopher Cieri, Mark Liberman, Sunghye Cho, ically constructing a corpus of sentential paraphrases.
Stephanie Strassel, James Fiumara, and Jonathan In International Joint Conference on Natural Language
Wright. 2022. Reflections on 30 years of language re- Processing.
source development and sharing. In Proceedings of the
Thirteenth Language Resources and Evaluation Con- Qingxiu Dong, Lei Li, Damai Dai, Ce Zheng, Jingyuan
ference, pages 543–550, Marseille, France. European Ma, Rui Li, Heming Xia, Jingjing Xu, Zhiyong Wu,
Language Resources Association. Baobao Chang, Xu Sun, Lei Li, and Zhifang Sui. 2024a.
A survey on in-context learning. In Proceedings of the
Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, 2024 Conference on Empirical Methods in Natural Lan-
Ashish Sabharwal, Carissa Schoenick, and Oyvind guage Processing, pages 1107–1128, Miami, Florida,
Tafjord. 2018. Think you have solved question an- USA. Association for Computational Linguistics.
swering? try arc, the ai2 reasoning challenge. ArXiv,
abs/1803.05457. Yihong Dong, Kangcheng Luo, Xue Jiang, Zhi Jin, and
Ge Li. 2024b. PACE: Improving prompt with actor-
Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, critic editing for large language model. In Findings of
Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plap- the Association for Computational Linguistics: ACL
pert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. 2024, pages 7304–7323, Bangkok, Thailand. Associa-
2021. Training verifiers to solve math word problems. tion for Computational Linguistics.
arXiv preprint arXiv:2110.14168.
Yingjun Du, Wenfang Sun, and Cees GM Snoek.
Mike Conover, Matt Hayes, Ankit Mathur, Jianwei Xie, 2024. Ipo: Interpretable prompt optimization for vision-
Jun Wan, Sam Shah, Ali Ghodsi, Patrick Wendell, Matei language models. arXiv preprint arXiv:2410.15397.
Zaharia, and Reynold Xin. 2023. Free dolly: Introduc-
ing the world’s first truly open instruction-tuned llm. Dheeru Dua, Yizhong Wang, Pradeep Dasigi, Gabriel
Company Blog of Databricks. Stanovsky, Sameer Singh, and Matt Gardner. 2019.
Drop: A reading comprehension benchmark requiring
Leyang Cui, Yu Wu, Shujie Liu, Yue Zhang, and Ming discrete reasoning over paragraphs. In North American
Zhou. 2020. Mutual: A dataset for multi-turn dialogue Chapter of the Association for Computational Linguis-
reasoning. ArXiv, abs/2004.04494. tics.
Stefan Daniel Dumitrescu, Petru Rebeja, Beáta Lőrincz, Han He, Qianchu Liu, Lei Xu, Chaitanya Shivade,
Mihaela Găman, Mihai Daniel Ilie, Andrei Pruteanu, Yi Zhang, Sundararajan Srinivasan, and Katrin Kirch-
Adriana Stan, Luciana Morogan, Traian Rebedea, and hoff. 2025. Crispo: Multi-aspect critique-suggestion-
Sebastian Ruder. 2021. Liro: Benchmark and leader- guided automatic prompt optimization for text genera-
board for romanian language tasks. In NeurIPS tion.
Datasets and Benchmarks.
Dan Hendrycks, Collin Burns, Steven Basart, Andy
Ibrahim Abu Farha and Walid Magdy. 2020a. From Zou, Mantas Mazeika, Dawn Xiaodong Song, and Ja-
arabic sentiment analysis to sarcasm detection: The cob Steinhardt. 2020. Measuring massive multitask
arsarcasm dataset. In OSACT. language understanding. ArXiv, abs/2009.03300.
Ibrahim Abu Farha and Walid Magdy. 2020b. From Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul
arabic sentiment analysis to sarcasm detection: The Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob
arsarcasm dataset. In OSACT. Steinhardt. Measuring mathematical problem solving
Chrisantha Fernando, Dylan Banarse, Henryk with the math dataset. In Thirty-fifth Conference on
Michalewski, Simon Osindero, and Tim Rock- Neural Information Processing Systems Datasets and
täschel. 2023. Promptbreeder: Self-referential Benchmarks Track (Round 2).
self-improvement via prompt evolution. ArXiv,
Or Honovich, Uri Shaham, Samuel R. Bowman, and
abs/2309.16797.
Omer Levy. 2022. Instruction induction: From few
Rory A. Fisher. 1936. The use of multiple measure- examples to natural language task descriptions. ArXiv,
ments in taxonomic problems. Annals of Human Genet- abs/2205.10782.
ics, 7:179–188.
Or Honovich, Uri Shaham, Samuel R. Bowman, and
Noa Garcia, Chentao Ye, Zihua Liu, Qingtao Hu, Mayu Omer Levy. 2023. Instruction induction: From few
Otani, Chenhui Chu, Yuta Nakashima, and Teruko Mita- examples to natural language task descriptions. In Pro-
mura. 2020. A dataset and baselines for visual question ceedings of the 61st Annual Meeting of the Association
answering on art. In European Conference on Computer for Computational Linguistics (Volume 1: Long Papers),
Vision, pages 92–108. pages 1935–1952, Toronto, Canada. Association for
Computational Linguistics.
Miguel Garc’ia-Orteg’on, Gregor N. C. Simm, Austin
Tripp, José Miguel Hernández-Lobato, Andreas Ben- Mohammad Javad Hosseini, Hannaneh Hajishirzi, Oren
der, and Sergio Bacallado. 2021. Dockstring: Easy Etzioni, and Nate Kushman. 2014. Learning to solve
molecular docking yields better benchmarks for ligand arithmetic word problems with verb categorization. In
design. Journal of Chemical Information and Modeling, Conference on Empirical Methods in Natural Language
62:3486 – 3502. Processing.
Claire Gardent, Anastasia Shimorina, Shashi Narayan, Bairu Hou, Joe O’Connor, Jacob Andreas, Shiyu Chang,
and Laura Perez-Beltrachini. 2017. Creating training and Yang Zhang. 2023. Promptboosting: black-box text
corpora for nlg micro-planners. In Annual Meeting of classification with ten forward passes. In Proceedings of
the Association for Computational Linguistics. the 40th International Conference on Machine Learning,
ICML’23. JMLR.org.
Mor Geva, Daniel Khashabi, Elad Segal, Tushar Khot,
Dan Roth, and Jonathan Berant. 2021. Did aristotle use
Cho-Jui Hsieh, Si Si, Felix Yu, and Inderjit Dhillon.
a laptop? a question answering benchmark with implicit
2024. Automatic engineering of long prompts. In Find-
reasoning strategies. Transactions of the Association
ings of the Association for Computational Linguistics:
for Computational Linguistics, 9:346–361.
ACL 2024, page 10672—10685, Bangkok, Thailand.
Bogdan Gliwa, Iwona Mochol, Maciej Biesek, and Association for Computational Linguistics.
Aleksander Wawer. 2019. Samsum corpus: A human-
annotated dialogue dataset for abstractive summariza- Minqing Hu and Bing Liu. 2004. Mining and sum-
tion. In Proceedings of the 2nd Workshop on New Fron- marizing customer reviews. Proceedings of the tenth
tiers in Summarization, pages 70–79. ACM SIGKDD international conference on Knowledge
discovery and data mining.
Chulaka Gunasekara, Jonathan K. Kummerfeld, Lazaros
Polymenakos, and Walter S. Lasecki. 2019. Dstc7 task Lifu Huang, Ronan Le Bras, Chandra Bhagavatula, and
1: Noetic end-to-end response selection. Proceedings Yejin Choi. 2019. Cosmos qa: Machine reading com-
of the First Workshop on NLP for Conversational AI. prehension with contextual commonsense reasoning.
In Proceedings of the 2019 Conference on Empirical
Qingyan Guo, Rui Wang, Junliang Guo, Bei Li, Kaitao Methods in Natural Language Processing and the 9th
Song, Xu Tan, Guoqing Liu, Jiang Bian, and Yujiu Yang. International Joint Conference on Natural Language
2024. Connecting large language models with evolu- Processing (EMNLP-IJCNLP), pages 2391–2401.
tionary algorithms yields powerful prompt optimizers.
In The Twelfth International Conference on Learning Rongjie Huang, Jiawei Huang, Dongchao Yang, Yi Ren,
Representations. Luping Liu, Mingze Li, Zhenhui Ye, Jinglin Liu, Xiang
Yin, and Zhou Zhao. 2023. Make-an-audio: Text-to- Omar Khattab, Arnav Singhvi, Paridhi Maheshwari,
audio generation with prompt-enhanced diffusion mod- Zhiyuan Zhang, Keshav Santhanam, Sri Vardhamanan,
els. In International Conference on Machine Learning, Saiful Haq, Ashutosh Sharma, Thomas T. Joshi, Hanna
pages 13916–13932. PMLR. Moazam, Heather Miller, Matei Zaharia, and Christo-
pher Potts. 2024. Dspy: Compiling declarative language
Yasaman Jafari, Dheeraj Mekala, Rose Yu, and Taylor model calls into self-improving pipelines.
Berg-Kirkpatrick. 2024. Morl-prompt: An empirical
analysis of multi-objective reinforcement learning for Johannes Kiesel, Maria Mestre, Rishabh Shukla, Em-
discrete prompt optimization. manuel Vincent, Payam Adineh, D. Corney, Benno
Stein, and Martin Potthast. 2019. Semeval-2019 task 4:
Yatai Ji, Jiacheng Zhang, Jie Wu, Shilong Zhang, Hyperpartisan news detection. In International Work-
Shoufa Chen, Chongjian GE, Peize Sun, Weifeng Chen, shop on Semantic Evaluation.
Wenqi Shao, Xuefeng Xiao, et al. 2024. Prompt-a-video:
Prompt your video diffusion model via preference- Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yu-
aligned llm. arXiv preprint arXiv:2412.15156. taka Matsuo, and Yusuke Iwasawa. 2023. Large lan-
guage models are zero-shot reasoners.
Yichen Jiang, Shikha Bordia, Zheng Zhong, Charles
Dognin, Maneesh Kumar Singh, and Mohit Bansal. Rik Koncel-Kedziorski, Hannaneh Hajishirzi, Ashish
2020. Hover: A dataset for many-hop fact extraction Sabharwal, Oren Etzioni, and Siena Dumas Ang. 2015.
and claim verification. In Findings. Parsing algebraic word problems into equations. Trans-
actions of the Association for Computational Linguis-
Can Jin, Hongwu Peng, Shiyu Zhao, Zhenting Wang, tics, 3:585–597.
Wujiang Xu, Ligong Han, Jiahui Zhao, Kai Zhong,
Sanguthevar Rajasekaran, and Dimitris N. Metaxas. Weize Kong, Spurthi Amba Hombaiah, Mingyang
2024. Apeer: Automatic prompt engineering enhances Zhang, Qiaozhu Mei, and Michael Bendersky. 2024.
large language model reranking. Prewrite: Prompt rewriting with reinforcement learning.
Di Jin, Eileen Pan, Nassim Oufattole, Wei-Hung Weng,
Shanu Kumar, Akhila Yesantarao Venkata, Shubhanshu
Hanyi Fang, and Peter Szolovits. 2020. What disease
Khandelwal, Bishal Santra, Parag Agrawal, and Man-
does this patient have? a large-scale open domain ques-
ish Gupta. 2024. Sculpt: Systematic tuning of long
tion answering dataset from medical exams. ArXiv,
prompts.
abs/2009.13081.

Qiao Jin, Bhuwan Dhingra, Zhengping Liu, William Tom Kwiatkowski, Jennimaria Palomaki, Olivia Red-
Cohen, and Xinghua Lu. 2019. Pubmedqa: A dataset field, Michael Collins, Ankur Parikh, Chris Alberti,
for biomedical research question answering. In Pro- Danielle Epstein, Illia Polosukhin, Jacob Devlin, Ken-
ceedings of the 2019 Conference on Empirical Methods ton Lee, et al. 2019. Natural questions: a benchmark
in Natural Language Processing and the 9th Interna- for question answering research. Transactions of the
tional Joint Conference on Natural Language Process- Association for Computational Linguistics, 7:453–466.
ing (EMNLP-IJCNLP), pages 2567–2577.
Guokun Lai, Qizhe Xie, Hanxiao Liu, Yiming Yang, and
Hideaki Joko, Shubham Chatterjee, Andrew Ramsay, Eduard Hovy. 2017. Race: Large-scale reading com-
Arjen P De Vries, Jeff Dalton, and Faegheh Hasibi. prehension dataset from examinations. arXiv preprint
2024. Doing personal laps: Llm-augmented dialogue arXiv:1704.04683.
construction for personalized multi-session conversa-
tional search. In Proceedings of the 47th International Kenton Lee, Ming-Wei Chang, and Kristina Toutanova.
ACM SIGIR Conference on Research and Development 2019. Latent retrieval for weakly supervised open do-
in Information Retrieval, pages 796–806. main question answering. ArXiv, abs/1906.00300.

Gurusha Juneja, Nagarajan Natarajan, Hua Li, Jian Jiao, Brian Lester, Rami Al-Rfou, and Noah Constant. 2021.
and Amit Sharma. 2024. Task facet learning: A struc- The power of scale for parameter-efficient prompt tun-
tured approach to prompt optimization. arXiv preprint ing. In Proceedings of the 2021 Conference on Empir-
arXiv:2406.10504. ical Methods in Natural Language Processing, pages
3045–3059, Online and Punta Cana, Dominican Repub-
David Jurgens, Srijan Kumar, Raine Hoover, Daniel A. lic. Association for Computational Linguistics.
McFarland, and Dan Jurafsky. 2018. Measuring the
evolution of a scientific field through citation frames. Hector J. Levesque, Ernest Davis, and L. Morgenstern.
Transactions of the Association for Computational Lin- 2011. The winograd schema challenge. In AAAI Spring
guistics, 6:391–406. Symposium: Logical Formalizations of Commonsense
Reasoning.
Omar Khattab, Keshav Santhanam, Xiang Lisa Li,
David Hall, Percy Liang, Christopher Potts, and Matei Bei Li, Rui Wang, Junliang Guo, Kaitao Song, Xu Tan,
Zaharia. 2022. Demonstrate-search-predict: Com- Hany Hassan, Arul Menezes, Tong Xiao, Jiang Bian,
posing retrieval and language models for knowledge- and JingBo Zhu. 2023a. Deliberate then generate: En-
intensive nlp. arXiv preprint arXiv:2212.14024. hanced prompting framework for text generation.
Cheng Li, Jindong Wang, Yixuan Zhang, Kaijie Zhu, Xiaogeng Liu, Zhiyuan Yu, Yizhe Zhang, Ning Zhang,
Wenxin Hou, Jianxun Lian, Fang Luo, Qiang Yang, and Chaowei Xiao. 2024c. Automatic and universal
and Xing Xie. 2023b. Large language models under- prompt injection attacks against large language models.
stand and can be enhanced by emotional stimuli. arXiv
preprint arXiv:2307.11760. Yilun Liu, Minggui He, Feiyu Yao, Yuhe Ji, Shimin Tao,
Jingzhou Du, Duan Li, Jian Gao, Li Zhang, Hao Yang,
Lihong Li, Wei Chu, John Langford, and Robert E. et al. 2024d. What do you want? user-centric prompt
Schapire. 2010. A contextual-bandit approach to per- generation for text-to-image synthesis via multi-turn
sonalized news article recommendation. In Proceedings guidance. arXiv preprint arXiv:2408.12910.
of the 19th International Conference on World Wide
Web, WWW ’10, page 661–670, New York, NY, USA. Xuan Do Long, Yiran Zhao, Hannah Brown, Yuxi Xie,
Association for Computing Machinery. James Xu Zhao, Nancy F. Chen, Kenji Kawaguchi,
Michael Shieh, and Junxian He. 2024. Prompt opti-
Moxin Li, Wenjie Wang, Fuli Feng, Yixin Cao, Jizhi mization via adversarial in-context learning. In Pro-
Zhang, and Tat-Seng Chua. 2023c. Robust prompt opti- ceedings of the 62nd Annual Meeting of the Association
mization for large language models against distribution for Computational Linguistics (Volume 1: Long Papers),
shifts. In Proceedings of the 2023 Conference on Em- pages 7308–7327, Bangkok, Thailand. Association for
pirical Methods in Natural Language Processing, pages Computational Linguistics.
1539–1554, Singapore. Association for Computational
Linguistics. Ryan Lowe, Nissan Pow, Iulian Serban, and Joelle
Pineau. 2015. The ubuntu dialogue corpus: A large
Zekun Li, Baolin Peng, Pengcheng He, Michel Galley, dataset for research in unstructured multi-turn dialogue
Jianfeng Gao, and Xifeng Yan. 2023d. Guiding large systems. In SIGDIAL Conference.
language models via directional stimulus prompting.
arXiv preprint arXiv:2302.11520. Junru Lu, Siyu An, Min Zhang, Yulan He, Di Yin, and
Xing Sun. 2025. FIPO: Free-form instruction-oriented
Stephanie Lin, Jacob Hilton, and Owain Evans. 2022. prompt optimization with preference dataset and mod-
Truthfulqa: Measuring how models mimic human false- ular fine-tuning schema. In Proceedings of the 31st
hoods. In Proceedings of the 60th Annual Meeting of International Conference on Computational Linguistics,
the Association for Computational Linguistics (Volume page 11029—11047, Abu Dhabi, UAE. Association for
1: Long Papers), pages 3214–3252. Computational Linguistics.

Tsung-Yi Lin, Michael Maire, Serge Belongie, James Yao Lu, Max Bartolo, Alastair Moore, Sebastian Riedel,
Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and and Pontus Stenetorp. 2021. Fantastically ordered
C Lawrence Zitnick. 2014. Microsoft coco: Common prompts and where to find them: Overcoming few-shot
objects in context. In Computer Vision–ECCV 2014: prompt order sensitivity. In Annual Meeting of the As-
13th European Conference, Zurich, Switzerland, Septem- sociation for Computational Linguistics.
ber 6-12, 2014, Proceedings, Part V 13, pages 740–755.
Springer. Yao Lu, Jiayi Wang, Raphael Tang, Sebastian Riedel,
and Pontus Stenetorp. 2024. Strings from the library of
Xiaoqiang Lin, Zhongxiang Dai, Arun Verma, See- babel: Random sampling as a strong baseline for prompt
Kiong Ng, Patrick Jaillet, and Bryan Kian Hsiang Low. optimisation. In Proceedings of the 2024 Conference of
2024. Prompt optimization with human feedback. the North American Chapter of the Association for Com-
putational Linguistics: Human Language Technologies
Wang Ling, Dani Yogatama, Chris Dyer, and Phil Blun- (Volume 1: Long Papers), page 2221—2231, Mexico
som. 2017. Program induction by rationale generation: City, Mexico. Association for Computational Linguis-
Learning to solve and explain algebraic word problems. tics.
In Annual Meeting of the Association for Computational
Linguistics. Yi Luan, Luheng He, Mari Ostendorf, and Hannaneh
Hajishirzi. 2018. Multi-task identification of entities, re-
Nelson F Liu, Kevin Lin, John Hewitt, Ashwin Paran- lations, and coreference for scientific knowledge graph
jape, Michele Bevilacqua, Fabio Petroni, and Percy construction. ArXiv, abs/1808.09602.
Liang. 2024a. Lost in the middle: How language mod-
els use long contexts. Transactions of the Association Andrew Maas, Raymond E Daly, Peter T Pham, Dan
for Computational Linguistics, 12:157–173. Huang, Andrew Y Ng, and Christopher Potts. 2011.
Learning word vectors for sentiment analysis. In Pro-
Shengcai Liu, Caishun Chen, Xinghua Qu, Ke Tang, ceedings of the 49th annual meeting of the association
and Yew Soon Ong. 2023. Large language models as for computational linguistics: Human language tech-
evolutionary optimizers. 2024 IEEE Congress on Evo- nologies, pages 142–150.
lutionary Computation (CEC), pages 1–8.
Oscar Mañas, Pietro Astolfi, Melissa Hall, Can-
Shihong Liu, Samuel Yu, Zhiqiu Lin, Deepak Pathak, dace Ross, Jack Urbanek, Adina Williams, Aish-
and Deva Ramanan. 2024b. Language models as black- warya Agrawal, Adriana Romero-Soriano, and Michal
box optimizers for vision-language models. In Proceed- Drozdzal. 2024. Improving text-to-image consistency
ings of the IEEE/CVF Conference on Computer Vision via automatic prompt optimization. arXiv preprint
and Pattern Recognition, pages 12687–12697. arXiv:2403.17804.
Bryan McCann, Nitish Shirish Keskar, Caiming Xiong, Maddie Simens, Amanda Askell, Peter Welinder, Paul
and Richard Socher. 2018. The natural language de- Christiano, Jan Leike, and Ryan Lowe. 2022. Train-
cathlon: Multitask learning as question answering. ing language models to follow instructions with human
arXiv preprint arXiv:1806.08730. feedback.
Rimon Melamed, Lucas H. McCabe, Tanay Wakhare, Ankit Pal, Logesh Kumar Umapathi, and Malaikannan
Yejin Kim, H. Howie Huang, and Enric Boix-Adsera. Sankarasubbu. 2022. Medmcqa: A large-scale multi-
2024. Prompts have evil twins. subject multi-choice dataset for medical domain ques-
tion answering. In Conference on health, inference, and
M Jehanzeb Mirza, Mengjie Zhao, Zhuoyuan Mao, learning, pages 248–260. PMLR.
Sivan Doveh, Wei Lin, Paul Gavrikov, Michael Dorken-
wald, Shiqi Yang, Saurav Jha, Hiromi Wakaki, et al. Rui Pan, Shuo Xing, Shizhe Diao, Wenhe Sun, Xiang
2024. Glov: Guided large language models as implicit Liu, KaShun Shum, Jipeng Zhang, Renjie Pi, and Tong
optimizers for vision language models. arXiv preprint Zhang. 2024. Plum: Prompt learning using metaheuris-
arXiv:2410.06154. tics. In Findings of the Association for Computational
Swaroop Mishra, Daniel Khashabi, Chitta Baral, and Linguistics: ACL 2024, page 2177—2197, Bangkok,
Hannaneh Hajishirzi. 2021. Cross-task generalization Thailand. Association for Computational Linguistics.
via natural language crowdsourcing instructions. In
Annual Meeting of the Association for Computational Bo Pang and Lillian Lee. 2004. A sentimental education:
Linguistics. Sentiment analysis using subjectivity summarization
based on minimum cuts. ArXiv, cs.CL/0409058.
Ioannis Mollas, Zoe Chrysopoulou, Stamatis Karlos,
and Grigorios Tsoumakas. 2020. Ethos: an on- Bo Pang and Lillian Lee. 2005. Seeing stars: Exploit-
line hate speech detection dataset. arXiv preprint ing class relationships for sentiment categorization with
arXiv:2006.08328. respect to rating scales. In Annual Meeting of the Asso-
ciation for Computational Linguistics.
Ramesh Nallapati, Bowen Zhou, Cícero Nogueira
dos Santos, Çaglar Gülçehre, and Bing Xiang. 2016. Arkil Patel, S. Bhattamishra, and Navin Goyal. 2021.
Abstractive text summarization using sequence-to- Are nlp models really able to solve simple math word
sequence rnns and beyond. In Conference on Com- problems? In North American Chapter of the Associa-
putational Natural Language Learning. tion for Computational Linguistics.

Shashi Narayan, Shay B. Cohen, and Mirella Lapata. Mohammad Taher Pilehvar and Jose Camacho-Collados.
2018. Don’t give me the details, just the summary! 2019. Wic: the word-in-context dataset for evaluating
topic-aware convolutional neural networks for extreme context-sensitive meaning representations. In Proceed-
summarization. ArXiv, abs/1808.08745. ings of the 2019 Conference of the North American
Chapter of the Association for Computational Linguis-
Ehsan Nezhadarya, Yang Liu, and Bingbing Liu. 2019. tics: Human Language Technologies, Volume 1 (Long
Boxnet: A deep learning method for 2d bounding box and Short Papers), pages 1267–1273.
estimation from bird’s-eye view point cloud. In 2019
IEEE Intelligent Vehicles Symposium (IV), pages 1557– Silviu Pitis, Michael R Zhang, Andrew Wang, and
1564. IEEE. Jimmy Ba. 2023. Boosted prompt ensembles for large
language models. arXiv preprint arXiv:2304.05970.
Yixin Nie, Adina Williams, Emily Dinan, Mohit Bansal,
Jason Weston, and Douwe Kiela. 2019. Adversarial nli: Edoardo Maria Ponti, Goran Glavaš, Olga Majewska,
A new benchmark for natural language understanding. Qianchu Liu, Ivan Vulić, and Anna Korhonen. 2020.
ArXiv, abs/1910.14599. Xcopa: A multilingual dataset for causal commonsense
Jekaterina Novikova, Ondrej Dusek, and Verena Rieser. reasoning. arXiv preprint arXiv:2005.00333.
2017. The e2e dataset: New challenges for end-to-end
generation. ArXiv, abs/1706.09254. Archiki Prasad, Peter Hase, Xiang Zhou, and Mohit
Bansal. 2023. Grips: Gradient-free, edit-based instruc-
Krista Opsahl-Ong, Michael J Ryan, Josh Purtell, David tion search for prompting large language models.
Broman, Christopher Potts, Matei Zaharia, and Omar
Khattab. 2024. Optimizing instructions and demon- Reid Pryzant, Dan Iter, Jerry Li, Yin Lee, Chenguang
strations for multi-stage language model programs. In Zhu, and Michael Zeng. 2023. Automatic prompt opti-
Proceedings of the 2024 Conference on Empirical Meth- mization with “gradient descent” and beam search. In
ods in Natural Language Processing, page 9340—9366, Proceedings of the 2023 Conference on Empirical Meth-
Miami, Florida, USA. Association for Computational ods in Natural Language Processing, page 7957—7968,
Linguistics. Singapore. Association for Computational Linguistics.

Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Car- Ye Qi, Devendra Singh Sachan, Matthieu Felix, Sarguna
roll L. Wainwright, Pamela Mishkin, Chong Zhang, Padmanabhan, and Graham Neubig. 2018. When and
Sandhini Agarwal, Katarina Slama, Alex Ray, John why are pre-trained word embeddings useful for neural
Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, machine translation? ArXiv, abs/1804.06323.
Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Noah Shinn, Federico Cassano, Ashwin Gopinath,
Percy Liang. 2016. Squad: 100,000+ questions for Karthik Narasimhan, and Shunyu Yao. 2024. Reflexion:
machine comprehension of text. In Conference on Em- Language agents with verbal reinforcement learning.
pirical Methods in Natural Language Processing. Advances in Neural Information Processing Systems,
36.
David Rein, Betty Li Hou, Asa Cooper Stickland, Jack-
son Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Mohit Shridhar, Xingdi Yuan, Marc-Alexandre
Michael, and Samuel R. Bowman. 2023. Gpqa: A Côté, Yonatan Bisk, Adam Trischler, and Matthew
graduate-level google-proof q&a benchmark. ArXiv, Hausknecht. 2020. Alfworld: Aligning text and em-
abs/2311.12022. bodied environments for interactive learning. arXiv
Melissa Roemmele, Cosmin Adrian Bejan, and An- preprint arXiv:2010.03768.
drew S Gordon. 2011. Choice of plausible alternatives:
An evaluation of commonsense causal reasoning. In Ankita Sinha, Wendi Cui, Kamalika Das, and Jiaxin
2011 AAAI spring symposium series. Zhang. 2024. Survival of the safest: Towards secure
prompt optimization through interleaved multi-objective
Subhro Roy and Dan Roth. 2016. Solving general arith- evolution. In Proceedings of the 2024 Conference on
metic word problems. ArXiv, abs/1608.01413. Empirical Methods in Natural Language Processing:
Industry Track, pages 1016–1027, Miami, Florida, US.
Maarten Sap, Hannah Rashkin, Derek Chen, Ronan Association for Computational Linguistics.
Le Bras, and Yejin Choi. 2019. Social iqa: Common-
sense reasoning about social interactions. In Proceed- Richard Socher, Alex Perelygin, Jean Wu, Jason
ings of the 2019 Conference on Empirical Methods Chuang, Christopher D Manning, Andrew Y Ng, and
in Natural Language Processing and the 9th Interna- Christopher Potts. 2013. Recursive deep models for se-
tional Joint Conference on Natural Language Process- mantic compositionality over a sentiment treebank. In
ing (EMNLP-IJCNLP), pages 4463–4473. Proceedings of the 2013 conference on empirical meth-
Tobias Schnabel and Jennifer Neville. 2024. Symbolic ods in natural language processing, pages 1631–1642.
prompt program search: A structure-aware approach to
efficient compile-time prompt optimization. Gizem Sogancioglu, Hakime Öztürk, and Arzucan
Özgür. 2017. Biosses: a semantic sentence similarity
Christoph Schuhmann, Romain Beaumont, Richard estimation system for the biomedical domain. Bioinfor-
Vencu, Cade Gordon, Ross Wightman, Mehdi Cherti, matics, 33:i49 – i58.
Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell
Wortsman, et al. 2022. Laion-5b: An open large-scale Alessandro Sordoni, Eric Yuan, Marc-Alexandre Côté,
dataset for training next generation image-text models. Matheus Pereira, Adam Trischler, Ziang Xiao, Arian
Advances in Neural Information Processing Systems, Hosseini, Friederike Niedtner, and Nicolas Le Roux.
35:25278–25294. 2023. Joint prompt optimization of stacked llms us-
ing variational inference. In Advances in Neural Infor-
Melanie Sclar, Yejin Choi, Yulia Tsvetkov, and Alane
mation Processing Systems, volume 36, pages 58128–
Suhr. Quantifying language models’ sensitivity to spu-
58151. Curran Associates, Inc.
rious features in prompt design or: How i learned to
start worrying about prompt formatting. In The Twelfth
International Conference on Learning Representations. Aarohi Srivastava, Abhinav Rastogi, Abhishek Rao,
Abu Awal Md Shoeb, Abubakar Abid, Adam Fisch,
Jingyuan Selena She, Christopher Potts, Sam Bowman, Adam R Brown, Adam Santoro, Aditya Gupta, Adrià
and Atticus Geiger. 2023. Scone: Benchmarking nega- Garriga-Alonso, et al. 2022. Beyond the imitation game:
tion reasoning in language models with fine-tuning and Quantifying and extrapolating the capabilities of lan-
in-context learning. In Annual Meeting of the Associa- guage models. arXiv preprint arXiv:2206.04615.
tion for Computational Linguistics.
Hao Sun, Alihan Hüyük, and Mihaela van der Schaar.
Zeru Shi, Zhenting Wang, Yongye Su, Weidi Luo, Fan 2024a. Query-dependent prompt evaluation and opti-
Yang, and Yongfeng Zhang. 2024. Robustness-aware mization with offline inverse RL. In The Twelfth Inter-
automatic prompt optimization. national Conference on Learning Representations.
Taylor Shin, Yasaman Razeghi, Robert L. Logan IV,
Eric Wallace, and Sameer Singh. 2020. AutoPrompt: Hong Sun, Xue Li, Yinchuan Xu, Youkow Homma,
Eliciting Knowledge from Language Models with Au- Qi Cao, Min Wu, Jian Jiao, and Denis Charles. 2023.
tomatically Generated Prompts. In Proceedings of the Autohint: Automatic prompt optimization with hint gen-
2020 Conference on Empirical Methods in Natural Lan- eration. arXiv preprint arXiv:2307.07415.
guage Processing (EMNLP), pages 4222–4235, Online.
Association for Computational Linguistics. Jingwei Sun, Ziyue Xu, Hongxu Yin, Dong Yang,
Daguang Xu, Yudong Liu, Zhixu Du, Yiran Chen, and
Noah Shinn, Federico Cassano, Edward Berman, Ash- Holger R. Roth. 2024b. Fedbpt: efficient federated
win Gopinath, Karthik Narasimhan, and Shunyu Yao. black-box prompt tuning for large language models. In
2023. Reflexion: Language agents with verbal rein- Proceedings of the 41st International Conference on
forcement learning. Machine Learning, ICML’24. JMLR.org.
Mirac Suzgun, Nathan Scales, Nathanael Schärli, Sebas- Yizhong Wang, Yeganeh Kordi, Swaroop Mishra, Alisa
tian Gehrmann, Yi Tay, Hyung Won Chung, Aakanksha Liu, Noah A. Smith, Daniel Khashabi, and Hannaneh
Chowdhery, Quoc Le, Ed Chi, Denny Zhou, et al. Hajishirzi. 2023. Self-instruct: Aligning language mod-
2023. Challenging big-bench tasks and whether chain- els with self-generated instructions. In Proceedings of
of-thought can solve them. In Findings of the Associa- the 61st Annual Meeting of the Association for Com-
tion for Computational Linguistics: ACL 2023, pages putational Linguistics (Volume 1: Long Papers), pages
13003–13051. 13484–13508, Toronto, Canada. Association for Com-
putational Linguistics.
Alon Talmor, Jonathan Herzig, Nicholas Lourie, and
Jonathan Berant. 2019. Commonsenseqa: A question Yushi Wang, Jonathan Berant, and Percy Liang. 2015.
answering challenge targeting commonsense knowledge. Building a semantic parser overnight. In Annual Meet-
ArXiv, abs/1811.00937. ing of the Association for Computational Linguistics.

Prashant Trivedi, Souradip Chakraborty, Avinash Reddy, Zhichao Wang, Bin Bi, Shiva Kumar Pentyala, Kiran
Vaneet Aggarwal, Amrit Singh Bedi, and George K. Ramnath, Sougata Chaudhuri, Shubham Mehrotra, Zixu,
Atia. 2025. Align-pro: A principled approach to prompt Zhu, Xiang-Bo Mao, Sitaram Asur, Na, and Cheng.
optimization for llm alignment. 2024b. A comprehensive survey of llm alignment tech-
niques: Rlhf, rlaif, ppo, dpo and more.
Nirali Vaghani and Mansi Thummar. 2023. Flipkart
product reviews with sentiment dataset. Alex Warstadt, Amanpreet Singh, and Samuel R. Bow-
man. 2018. Neural network acceptability judgments.
Ellen M Voorhees and Dawn M Tice. 2000. Building a Transactions of the Association for Computational Lin-
question answering test collection. In Proceedings of guistics, 7:625–641.
the 23rd annual international ACM SIGIR conference
on Research and development in information retrieval, Janyce Wiebe, Theresa Wilson, and Claire Cardie. 2005.
pages 200–207. Annotating expressions of opinions and emotions in
language. Language Resources and Evaluation, 39:165–
Xingchen Wan, Ruoxi Sun, Hootan Nakhost, and Ser- 210.
can O. Arik. 2024. Teach better or show smarter? on
Adina Williams, Nikita Nangia, and Samuel R. Bow-
instructions and exemplars in automatic prompt opti-
man. 2017. A broad-coverage challenge corpus for
mization.
sentence understanding through inference. In North
American Chapter of the Association for Computational
Ruochen Wang, Sohyun An, Minhao Cheng, Tianyi
Linguistics.
Zhou, Sung Ju Hwang, and Cho-Jui Hsieh. 2025. One
prompt is not enough: automated construction of a Yurong Wu, Yan Gao, Bin Benjamin Zhu, Zineng Zhou,
mixture-of-expert prompts. In Proceedings of the Xiaodi Sun, Sheng Yang, Jian-Guang Lou, Zhiming
41st International Conference on Machine Learning, Ding, and Linjun Yang. 2024. StraGo: Harnessing
ICML’24. JMLR.org. strategic guidance for prompt optimization. In Find-
ings of the Association for Computational Linguistics:
Ruoyao Wang, Peter Jansen, Marc-Alexandre Côté, and EMNLP 2024, pages 10043–10061, Miami, Florida,
Prithviraj Ammanabrolu. 2022a. Scienceworld: Is your USA. Association for Computational Linguistics.
agent smarter than a 5th grader? In Proceedings of
the 2022 Conference on Empirical Methods in Natural Jasper Xian, Saron Samuel, Faraz Khoubsirat, Ronak
Language Processing, pages 11279–11298. Pradeep, Md Arafat Sultan, Radu Florian, Salim
Roukos, Avirup Sil, Christopher Potts, and Omar Khat-
William Yang Wang. 2017. “liar, liar pants on fire”: tab. 2024. Prompts as auto-optimized training hyperpa-
A new benchmark dataset for fake news detection. In rameters: Training best-in-class ir models from scratch
Annual Meeting of the Association for Computational with 10 gold labels. arXiv preprint arXiv:2406.11706.
Linguistics.
Hanwei Xu, Yujun Chen, Yulun Du, Nan Shao, Wang
Xinyuan Wang, Chenxi Li, Zhen Wang, Fan Bai, Hao- Yanggang, Haiyu Li, and Zhilin Yang. 2022. Gps: Ge-
tian Luo, Jiayou Zhang, Nebojsa Jojic, Eric P. Xing, and netic prompt search for efficient few-shot learning. In
Zhiting Hu. 2024a. Promptagent: Strategic planning Proceedings of the 2022 Conference on Empirical Meth-
with language models enables expert-level prompt opti- ods in Natural Language Processing, pages 8162–8171.
mization. In The Twelfth International Conference on
Learning Representations, ICLR 2024, Vienna, Austria, Wei Xu, Alan Ritter, William B. Dolan, Ralph Grish-
May 7-11, 2024. OpenReview.net. man, and Colin Cherry. 2012. Paraphrasing for style. In
International Conference on Computational Linguistics.
Yizhong Wang, Yeganeh Kordi, Swaroop Mishra, Alisa
Liu, Noah A. Smith, Daniel Khashabi, and Hannaneh Weijia Xu, Andrzej Banburski-Fahey, and Nebojsa Jo-
Hajishirzi. 2022b. Self-instruct: Aligning language jic. 2024. Reprompting: automated chain-of-thought
models with self-generated instructions. In Annual prompt inference through gibbs sampling. In Proceed-
Meeting of the Association for Computational Linguis- ings of the 41st International Conference on Machine
tics. Learning, ICML’24. JMLR.org.
Chengrun Yang, Xuezhi Wang, Yifeng Lu, Hanxiao Liu, Lechen Zhang, Tolga Ergen, Lajanugen Logeswaran,
Quoc V. Le, Denny Zhou, and Xinyun Chen. 2024a. Moontae Lee, and David Jurgens. 2024b. Sprig: Im-
Large language models as optimizers. proving large language model performance by system
prompt optimization. ArXiv, abs/2410.14826.
Chengrun Yang, Xuezhi Wang, Yifeng Lu, Hanxiao Liu,
Quoc V Le, Denny Zhou, and Xinyun Chen. 2024b. Tianjun Zhang, Xuezhi Wang, Denny Zhou, Dale Schu-
Large language models as optimizers. In The Twelfth urmans, and Joseph E. Gonzalez. 2022. Tempera: Test-
International Conference on Learning Representations. time prompting via reinforcement learning.
Tianyi Zhang*, Varsha Kishore*, Felix Wu*, Kilian Q.
Muchen Yang, Moxin Li, Yongle Li, Zijun Chen, Chong-
Weinberger, and Yoav Artzi. 2020. Bertscore: Evaluat-
ming Gao, Junqi Zhang, Yangyang Li, and Fuli Feng.
ing text generation with bert. In International Confer-
2024c. Dual-phase accelerated prompt optimization.
ence on Learning Representations.
In Findings of the Association for Computational Lin-
guistics: EMNLP 2024, pages 12163–12173, Miami, Xiang Zhang, Junbo Jake Zhao, and Yann LeCun. 2015.
Florida, USA. Association for Computational Linguis- Character-level convolutional networks for text classifi-
tics. cation. In Neural Information Processing Systems.
Sheng Yang, Yurong Wu, Yan Gao, Zineng Zhou, Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan
Bin Benjamin Zhu, Xiaodi Sun, Jian-Guang Lou, Zhim- Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuo-
ing Ding, Anbang Hu, Yuan Fang, et al. 2024d. Ampo: han Li, Dacheng Li, Eric P. Xing, Hao Zhang, Joseph E.
Automatic multi-branched prompt optimization. arXiv Gonzalez, and Ion Stoica. 2023. Judging llm-as-a-judge
preprint arXiv:2410.08696. with mt-bench and chatbot arena. In Proceedings of the
37th International Conference on Neural Information
Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, Processing Systems, NIPS ’23, Red Hook, NY, USA.
William W. Cohen, Ruslan Salakhutdinov, and Christo- Curran Associates Inc.
pher D. Manning. 2018. Hotpotqa: A dataset for di-
verse, explainable multi-hop question answering. In Han Zhou, Xingchen Wan, Ivan Vulić, and Anna Ko-
Conference on Empirical Methods in Natural Language rhonen. 2023. Survival of the most influential prompts:
Processing. Efficient black-box prompt search via clustering and
pruning. In Findings of the Association for Computa-
Qinyuan Ye, Maxamed Axmed, Reid Pryzant, and tional Linguistics: EMNLP 2023, pages 13064–13077,
Fereshte Khani. 2024. Prompt engineering a prompt Singapore. Association for Computational Linguistics.
engineer.
Yongchao Zhou, Andrei Ioan Muresanu, Ziwen Han,
Keiran Paster, Silviu Pitis, Harris Chan, and Jimmy Ba.
Mert Yuksekgonul, Federico Bianchi, Joseph Boen, 2022. Large language models are human-level prompt
Sheng Liu, Zhi Huang, Carlos Guestrin, and James Zou. engineers.
2024. Textgrad: Automatic "differentiation" via text.

John M. Zelle and Raymond J. Mooney. 1996. Learning


to parse database queries using inductive logic program-
ming. In AAAI/IAAI, Vol. 2.

Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali


Farhadi, and Yejin Choi. 2019. Hellaswag: Can a ma-
chine really finish your sentence? In Proceedings of the
57th Annual Meeting of the Association for Computa-
tional Linguistics, pages 4791–4800.

Pengwei Zhan, Zhen Xu, Qian Tan, Jie Song, and


Ru Xie. 2024. Unveiling the lexical sensitivity of llms:
Combinatorial optimization for prompt enhancement.
In Conference on Empirical Methods in Natural Lan-
guage Processing.

Chenrui Zhang, Lin Liu, Chuyuan Wang, Xiao


Sun, Hongyu Wang, Jinpeng Wang, and Mingchen
Cai. 2024a. Prefer: prompt ensemble learning via
feedback-reflect-refine. In Proceedings of the Thirty-
Eighth AAAI Conference on Artificial Intelligence and
Thirty-Sixth Conference on Innovative Applications
of Artificial Intelligence and Fourteenth Symposium
on Educational Advances in Artificial Intelligence,
AAAI’24/IAAI’24/EAAI’24. AAAI Press.
12 Appendix
12.1 Notation
We now define the notation of key terms and expressions used throughout the paper.

1. T = Task type, I= Task instruction, E = (xi, yi)ei=1 Few shot demonstrations in the prompt, τ =
Template delimiters, z = CoT recipe for a task-instance, zi ∈ Ii
2. Mtask target model, MAP O APO system
3. ρ = concat([s1 , s2 , . . . , sm ]) = concat(I, τ, E) Prompt composed of m sentences, which comprise
of Instruction, template delimiters and few-shot demonstrations.
4. D = {(xi , yi )}m
i=1 collection of m input-output pairs. Dval is the validation set used to validate
prompt performance, Dtrain is the training set used to finetune the language model(Reprompting).
5. {f1 , f2 , . . .} ∈ F metric function upon which to evaluate task-prompt performance
6. r : S × A → R= reward model score, where S is the state-space and A is the action-space
7. |V | = length of vocabulary
8. ϕ : S ∈ V∗ → Rd embedding function which takes in a sentence generated as a finite sequence of
tokens belonging to a vocabulary V, and generating a floating point array representation of dimension
d
9. ρ∗ = argmaxρ∈V∗ EDval [fi (ρ)] The best performing prompt based on the metric score on validation
set
10. k = number of candidates for top-K search, B = Beam width for beam search, N = number of
iterations for search
11. C = number of experts in a Mixture of Experts approach (MOP), µC = cluster centroid of cluster C
(MOP).
12. LLMtarget = target model which will be used for inference, LLMrewriter = rewriter model which
will be used for rewriter, LLMevaluator = evaluator model which provides the LLM feedback to
prompts / responses or both
13. λ with subscripts to denote different latency types: λt = Total training cost/latency, including all
offline costs for data collection, preprocessing, and model fine-tuning, λi = per-example inference
latency, λm = MLM inference latency per-example

12.2 Excluded works


FedBPT (Sun et al., 2024b) used federated learning to update soft prompts and not discrete tokens.
Deliberate-then-generate (Li et al., 2023a) randomly sampled arbitrary noisy inference and prompted
the task LLM to deliberate on the wrong inference, while Reflexion (Shinn et al., 2023) agents maintain
an episodic buffer of past deliberations. Neither method optimizes the input prompt. AutoPrompt (Shin
et al., 2020) required gradient access to the task LLM and therefore doesn’t remain blackbox.
12.3 UCB based selection algorithm
Algorithm 2 Select(·) with UCB Bandits
Require: n prompts ρ1 , ..., ρn , dataset Dval , T time steps, metric function m
1: Initialize: Nt (ρi ) ← 0 for all i = 1, . . . , n
2: Initialize: Qt (ρi ) ← 0 for all i = 1, . . . , n
3: for t = 1, . . . , T do
4: Sample uniformly n Dsample ⊂ qDval o
5: ρi ← arg maxρ NQtt(ρ (ρ)
i)
+ c Nlog t
t (ρ)
6: Observe reward ri,t = m(ρi , Dsample )
7: Nt (ρi ) ← Nt (ρi ) + |Dsample |
8: Qt (ρi ) ← Qt (ρi ) + ri,t
9: return SelectT opb (QT /NT )
13 Comparison of different approaches + Tasks
13.1 Comparison
Below we offer a comprehensive comparison of all the surveyed methods against our framework, covering
the following aspects

1. Seed instructions
2. Inference evaluation
3. Candidate generation
4. Search+filter strategy
5. Iteration depth
6. Optimization time complexity
7. Prompt generation model
8. Target models
SNo. Method Seed instruc- Inference evaluation Candidate generation Search+filter strategy Iteration depth Optimization time Prompt genera- Target models
tions complexity tion model
1 GPS (Xu et al., 2022) Manually Task accuracy Genetic Algorithm: Metaheuristic ensemble Fixed O(T ∗ N ∗ k ∗ λi ) T0
created Back translation,
Cloze,
Sentence continuation
2 GRIPS (Prasad et al., Manually Entropy-based score+ Phrase level TopK selection Fixed O(k ∗ N ∗ |Dval | ∗ PEGASUS para- InstructGPT
2023) created Task accuracy add/remove/swap/paraphrase B) phrase model
3 Instruction induction Instruction Accuracy + LLM-rewriter Fixed O(|ρ| ∗ λi ) InstructGPT, GPT-3 InstructGPT, GPT-3
(Honovich et al., 2023) induction BERTScore
4 RLPrompt (Deng et al., Manually Task accuracy + RL-based trained NN TopK selection Fixed O(N ∗ ρ ∗ |V | ∗ λi ) RoBERTa-large 1/ BERT, 2/ GPT-2
2022) created Reward model score Reward model-
DistilBERT
5 TEMPERA (Zhang et al., Manually Task accuracy RL-trained NN Fixed O(N ∗ k ∗ |V | ∗ C) RoBERTa-large RoBERTa-large
2022) created
6 AELP (Hsieh et al., 2024) Manually Task accuracy Genetic algorithm: Beam search Fixed O(N ∗ ρ ∗ k ∗ |D| ∗ PaLM 2-L PaLM text-bison
created LLM-mutator λi )
7 APE (Zhou et al., 2022) Instruction Task accuracy No new candidates TopK selection Fixed O(N ∗ k ∗ |Dval | ∗ InstructGPT, GPT- InstructGPT, GPT-3
induction λi ) 3, T5,
InsertGPT
8 AutoHint (Sun et al., Manually Task accuracy + LLM rewriter TopK selection Fixed O(T ∗ |D| ∗ λi ) GPT-4
2023) created LLM-feedback
9 BDPL (Diao et al., 2022) Manually Task accuracy RL-trained NN TopK selection Variable O(N ∗ k ∗ λi ) RoBERTa, GPT-3 RoBERTa, GPT-3
created
10 Boosted Prompting Instruction- Task accuracy Ensemble based TopK selection Variable O(N ∗ k ∗ λi ) text-curie-001, text- text-curie-001, text-curie-003,
(Pitis et al., 2023) induction method curie-003, GPT-3.5, GPT-3.5, code-davinci-002
code-davinci-002
11 BPO (Cheng et al., 2024) Manually LLMaaJ (pairwise) Finetuned LLMs NA NA O(λt + |Dval | ∗ λi ) Llama2-7b-chat Vicuna-7b-v1.3,
created vicuna-13b-v1.3, llama-1-7b,
llama-1-13b
12 CLAPS (Zhou et al., Manually Entropy-based score+ Genetic Algorithm: TopK selection Variable O(N ∗ k ∗ |V | ∗ λi ) Flan-T5 Flan-T5 large and base
2023) created Task accuracy Mutation + Crossover
13 Directional-stimulus (Li Manually BLEU, BERTScore RL-trained NN Variable O(λt ) T5, GPT-2 ChatGPT, Codex, InstructGPT
et al., 2023d) created
14 DLN (Sordoni et al., Manually Task accuracy + NLL LLM mutator TopK selection Fixed O(N ∗ k ∗ |Dtrain |) GPT-3 (text- GPT-3 (text-davinci-003), GPT-4
2023) created davinci-003),
GPT-4
15 DSP (Khattab et al., 2022)Instruction Task accuracy Program Synthesis TopK selection Fixed O(N ∗ k ∗ λi ) GPT-3.5 LM: GPT-3.5,
induction Retrieval: ColBERTv2
16 DSPy (Khattab et al., Manually cre- Task accuracy + Program Synthesis TopK selection Variable O(N ∗ k ∗ B ∗ λi )
2024) ated + LLM-feedback
Instruction
Induction
17 GATE (Joko et al., 2024) Manually Human feedback LLM rewriter Open-ended O(N ∗ (λm + GPT-4 GPT-4
created |Dval | ∗ λi ))
18 GPO (Li et al., 2023c) Instruction Task-Accuracy and F1 Metaprompt-design TopK selection O(N ∗ C ∗ |V | ∗ B ∗ gpt-3.5-turbo-0301 gpt-3.5-turbo-0301
induction E)
19 PACE (Dong et al., 2024b) Manually NLL + Task accuracy - LLM-rewriter TopK selection <3 O(N ∗ |ρ| ∗ |Dval |) gpt-3.5-turbo text-davinci-002,
created BLEU and BERTScore (0301) text-davinci-003,
(gpt-3.5-turbo), GPT-4
20 PREFER (Zhang et al., Manually Task accuracy LLM-rewriter + TopK selection Fixed O(N ∗ |ρ| ∗ |Dval |) ChatGPT ChatGPT
2024a) created Ensemble method
21 Promptagent (Wang et al., Manually Task accuracy + LLM rewriter UCT-based bandit-search Fixed O(N ∗ k ∗ λi ) GPT-4 GPT-3.5, GPT-4, PaLM-2
2024a) created LLM-feedback

Table 2: Comparison of all APO techniques based on our framework


SNo. Method Seed instruc- Inference evaluation Candidate generation Search+filter strategy Iteration depth Optimization time Prompt genera- Target models
tions complexity tion model
22 Promptboosting (Hou Instruction- Accuracy, F1 Score Ensemble based Beam-search Early Stopping O(λm ) T5 RoBERTa-large
et al., 2023) induction method
23 Promptbreeder (Fernando Manually LLM Feedback + Genetic Algorithm: Metaheuristic Ensemble Fixed O(ρ ∗ N ∗ |V | ∗ λi ) text-davinci-003, text-davinci-003, PaLM 2-L
et al., 2023) created Task accuracy Mutate + Crossover PaLM 2-L
(LLM-edits)
24 ProTeGi (Pryzant et al., Manually Task accuracy + LLM rewriter UCT-based bandit-search Fixed O(N ∗ C ∗ |Dval | ∗ GPT-3.5-Turbo GPT-3.5-turbo
2023) created LLM-feedback λi )
25 Random separators (Lu Manually Task accuracy LLM-rewriter TopK selection Fixed steps O(N ∗ k ∗ λ) GPT2 Large, GPT2 GPT2 Large, GPT2 XL,
et al., 2024) created XL, Mistral 7B, Mistral 7B Instruct,
Mistral 7B, Mistral Llama-Alpaca 7B, Llama2 7B.
7B Instruct, Llama2 7B Chat, ChatGPT
Llama-Alpaca 7B,
Llama2 7B.
Llama2 7B Chat,
ChatGPT
26 ABO (Yang et al., 2024b)
Manually cre- Task accuracy + LLM-rewriter TopK selection Fixed Steps O(B ∗ N ∗ λi ) GPT-4 GPT-3.5-Turbo, Llama-2-70B-
ated + LLM-feedback chat
Instruction
Induction
27 Adv-ICL (Long et al., Manually LLM Feedback LLM-rewriter Top-1 selection Fixed O(N ∗ k ∗ λi ) text-davinci-002, text-davinci-002, vicuna, Chat-
2024) created vicuna, GPT
ChatGPT
28 AMPO (Yang et al., Manually Task accuracy + Coverage-based TopK selection Variable O(N ∗ C ∗ λi ) GPT-4-turbo GPT-4-turbo
2024d) created F1 score
29 APEER (Jin et al., 2024) Manually Task accuracy-nDCG Feedback + preference Used 3 epochs O(N ∗ |ρ| ∗ |Dv al|) GPT4, GPT3.5, Llama3, Qwen2
created optimization
30 APOHF (Lin et al., 2024) Manually Task accuracy + LLM rewriter Linear UCB Fixed O(N ∗ T ) ChatGPT DALLE-3, ChatGPT
created Human feedback
31 BATPrompt (Shi et al., Manually Task accuracy + LLM rewriter TopK selection Fixed O(N ∗|D|∗|ρ|∗λi ) GPT-3.5-turbo GPT-3.5-turbo,
2024) created LLM-feedback GPT-4o-mini, Llama2-7b
32 COPLE (Zhan et al., Manually Task accuracy Token edits using Variable O(N ∗ |I| ∗ k ∗ RoBERTa Llama-2-7B-chat ,
2024) created MLM |Dv al| ∗ λi ) (filling masked to- Mistral-7B-Instruct-v0.1,
kens) ChatGPT (gpt-3.5-turbo-0125)
33 CRISPO (He et al., 2025) Manually LLM feedback + LLM rewriter TOP-K greedy search Fixed O(N ∗k∗(|Dtrain |∗ Claude Instant, Claude Instant,
created ROUGE-1/2/L F- λi + λm )) Claude 3 Sonnet, Claude 3 Sonnet,
measure, Mistral 7B, Llama3 Mistral 7B, Llama3 8B
AlignScore 8B
34 DAPO (Yang et al., 2024c) Manually Task accuracy LLM-rewriter Top-1 selection Fixed O(N ∗ k ∗ λi ) GPT-3.5-Turbo, GPT-3.5-Turbo,
created Baichuan2, Baichuan2, GPT-4
GPT-4
35 DRPO (Amini et al., Manually Reward model score + LLM rewriter Beam search Fixed O(B ∗ k ∗ N ) Mistral 7b, Mistral Mistral 7b, Mistral 7b (Instruct),
2024) created LLM Feedback 7b (Instruct), Llama 2 70b, Llama 2 70b (chat),
Llama 2 70b, Llama Llama 3 8b, Llama 3 8b (In-
2 70b (chat), struct),
Llama 3 8b, Llama gpt-3.5-turbo
3 8b (Instruct),
gpt-3.5-turbo
36 EVOPROMPT (Guo et al., Manually cre- Task Accuracy + Genetic Algorithm: Metaheuristic ensemble Early Stopping O(N ∗ k ∗ T ∗ λi ) Alpaca-7b, GPT-3.5
2024) ated + ROUGUE+ SARI Mutation operators+
Instruction Crossover
Induction
37 FIPO (Lu et al., 2025) Manually Task accuracy Finetuned LLMs O(λt +|Dv al|∗λi )) Tulu-13B, Tulu- Llama2-7B, Tulu2-13B,
created 70B Baichuan2-13B

Table 3: Comparison of all APO techniques based on our framework


SNo. Method Seed instruc- Inference evaluation Candidate generation Search+filter strategy Iteration depth Optimization time Prompt genera- Target models
tions complexity tion model
38 LMEA (Liu et al., 2023) Manually Numeric Score-based Genetic Algorithm: TopK selection Fixed O(N ∗ k ∗ λi ) GPT-3.5-turbo-0613
created Mutate + Crossover
(LLM-edits)
39 MIPRO (Opsahl-Ong Manually Task accuracy Program Synthesis TopK selection Fixed O(N ∗ |Dval | ∗ k ∗ GPT-3.5 (proposer Llama-3-8B (task LM)
et al., 2024) created λi ) LM)
40 MOP (Wang et al., 2025) Instruction Task Accuracy APE for each cluster TopK selection Fixed steps per- O(C ∗ N ∗ |Dval |) GPT-3.5-Turbo GPT-3.5-Turbo
induction cluster
41 MORL-Prompt (Jafari Manually Task accuracy + RL-based trained NN Fixed O(N ∗ C ∗ |V | ∗ k) distilGPT-2 GPT-2 (style transfer),
et al., 2024) created Reward score flan-T5-small (translation)
42 OIRL (Sun et al., 2024a) Manually Task accuracy + LLM rewriter O(|Dtrain |∗ρ∗λi + GPT4 Llama2-7B-chat,
created Reward model score λt + |Dval | ∗ λi ) Tigerbot-13B-chat, gpt3.5-turbo
43 OPRO (Yang et al., 2024a) Manually Task accuracy + Metaprompt design TopK selection Variable O(N ∗ k ∗ λi ) PaLM 2-L, text- PaLM family models
created LLM-feedback bison,
gpt-3.5-turbo and
GPT-4
44 PE2 (Ye et al., 2024) Manually cre- Task accuracy + Metaprompt design TopK selection Fixed O(N ∗ k ∗ λi ) GPT-4 text-davinci-003
ated + LLM-feedback
Instruction
Induction
45 PIN (Choi et al., 2024) Manually Task accuracy RL-trained LLM TopK selection Fixed O(N ∗ |V | ∗ λi ∗ C) OPT RoBERTa-large (classification),
created OPT models (others)
46 PLUM (Pan et al., 2024) Manually Task accuracy Genetic Algorithm: Metaheuristics Fixed steps O(N ∗ C ∗ k ∗ λi ) GPT-3-babbage GPT-3-babbage
created Mutate + crossover
47 PRewrite (Kong et al., Manually Task accuracy + RL-trained LLM TopK selection Fixed O(N ∗ C ∗ λi ∗ |V |) PaLM 2-S PaLM 2-L
2024) created Reward model score
48 PROMPTWIZARD Manually Task accuracy + Genetic Algorithm: TopK selection Fixed O(N ∗ C ∗ λi ) GPT3.5/GPT4 GPT3.5/GPT4/Llama-70B
(Agarwal et al., 2024) created LLM-feedback Mutate + Crossover
(LLM-edits)
49 PROMST (Chen et al., Manually Task accuracy + LLM rewriter TopK selection Fixed O(N ∗ k ∗ λi ) GPT-4 GPT-3.5, GPT-4
2024) created Human feedback
50 Reprompting (Xu et al., LLM generated Task accuracy LLM-rewriter Rejection sampling Fixed or until con- O(N ∗ k ∗ |ρ|) gpt-3.5-turbo, gpt-3.5-turbo, textdavinci-003
2024) CoT process. with exploration vergence textdavinci-003
51 SAMMO (Schnabel and Manually Task accuracy Program synthesis TopK selection Fixed O(N ∗ k ∗ λi ) Mixtral7x8B, Llama-2 70B,
Neville, 2024) created GPT3.5, GPT4
52 SCULPT (Kumar et al., Instruction Task accuracy + LLM-rewriter UCB bandit search Fixed O(N ∗ k ∗ |ρ| ∗ GPT-4o GPT-4o and Llama3.1-8B
2024) induction LLM-feedback |Dv al|)
on task-
README
53 SOS (Sinha et al., 2024) Manually Task accuracy + LLM-mutator TopK selection Fixed O(N ∗ C ∗ k ∗ λi ) GPT-3.5-turbo, GPT-3.5-turbo, Llama3-8B,
created LLM-feedback Llama3-8B, Mistral-7B
Mistral-7B
54 SPRIG (Zhang et al., Manually Task accuracy Genetic Algorithm: Beam-search Fixed O(N ∗B∗T ∗k∗λi ) tuner007/pegasus_paraphrase
Llama 3.1-8B Instruct,
2024b) created Mutate + Crossover (to- Mistral Nemo Instruct 2407,
kens) Qwen 2.5-7B Instruct,
Llama 70B, Qwen 2.5-72B,
Mistral Large 2407.
55 StraGo (Wu et al., 2024) Manually Task accuracy + Genetic Algorithm: Bandit Search (UCB) Early Stopping O(N ∗ k ∗ T ∗ λi ) GPT-4 GPT-3.5-turbo or GPT-4
created LLM-feedback Mutate + CrossOver
(tokens)
56 TextGrad (Yuksekgonul Manually Task accuracy + LLM rewriter Variable O(N ∗ |Dval | ∗ λi ) GPT-3.5, GPT-4o
et al., 2024) created LLM-feedback
57 UNIPROMPT (Juneja Manually cre- Task accuracy + LLM-rewriter Beam Search Early Stopping O(N ∗ k ∗ λi ) Fine-tuned Llama2- GPT-3.5
et al., 2024) ated + LLM-feedback 13B
Instruction
Induction

Table 4: Comparison of all APO techniques based on our framework


13.2 Evaluation tasks and datasets
Below we describe the different datasets and tasks that each method was evaluated on.
SNo. Paper Tasks
1 GPS (Xu et al., 2022) 10 unseen tasks from the T0 benchmark, which span:
1. Natural Language Inference: ANLI R1, R2, R3, CB, RTE (Nie et al., 2019; Dagan et al.,
2005).
2. Coreference Resolution: WSC, Winogrande.(Levesque et al., 2011)
3. Sentence Completion: COPA(Roemmele et al., 2011) , HellaSwag (Zellers et al., 2019).
4. Word Sense Disambiguation: WiC (Pilehvar and Camacho-Collados, 2019).
2 GRIPS (Prasad et al., 2023) 8 classification tasks from NaturalInstructions (Mishra et al., 2021)
3 Instruction induction (Honovich et al., 2022) 1. Spelling, 2. Syntax, 3. Morpho-syntax, 4. Lexical semantics,
5. Phonetics, 6. Knowledge, 7. Semantics, 8. Style
4 RLPrompt (Deng et al., 2022) 1. Classification
2. Text-style transfer
5 TEMPERA (Zhang et al., 2022) Classification
6 AELP (Hsieh et al., 2024) Big Bench Hard (Suzgun et al., 2023)
7 APE (Zhou et al., 2022) 1. 24 Instruction induction tasks (Honovich et al., 2022) 2. 21 BIG Bench Hard tasks (Suzgun
et al., 2023)
8 AutoHint (Sun et al., 2023) BIG-Bench Instruction Induction (Epistemic Reasoning, Logical Fallacy Detection, Implica-
tures, Hyperbaton, Causal Judgment, Winowhy) (Zhou et al., 2022)
9 BDPL (Diao et al., 2022) 1. MNLI (Williams et al., 2017), 2. QQP (Cer et al., 2017), 3. SST-2 (Socher et al., 2013), 4.
MRPC (Dolan and Brockett, 2005), 5. CoLA (Warstadt et al., 2018), 6. QNLI (Rajpurkar et al.,
2016), 7. RTE (Dagan et al., 2005), 8. CitationIntent (Jurgens et al., 2018), 9. SciERC (Luan
et al., 2018), 10. RCT (Dernoncourt and Lee, 2017), 11. HyperPartisan (Kiesel et al., 2019)
10 Boosted Prompting (Pitis et al., 2023) GSM8K (Cobbe et al., 2021) and AQuA (Garcia et al., 2020)
11 BPO (Cheng et al., 2024) Generation: Dolly Eval (Conover et al., 2023), Vicuna Eval (Chiang et al., 2023), Self-Instruct
Eval (Wang et al., 2022b)
12 CLAPS (Zhou et al., 2023)
13 Directional-stimulus (Li et al., 2023d) MultiWOZ (Budzianowski et al., 2018)
14 DLN (Sordoni et al., 2023) 1. Mpqa Sentiment analysis (Lu et al., 2021)
2. Trec Question type classification (Lu et al., 2021)
3. Subj Determine whether a sentence is subjective or objective (Lu et al., 2021)
4. Leopard (Bansal et al., 2019)- Disaster Determine whether a sentence is relevant to a disaster.
5. Leopard (Bansal et al., 2019)- Airline Airline tweet sentiment analysis.
6. BBH (Suzgun et al., 2023)- (Hyper, Nav, Date, Logic datasets)
15 DSP (Khattab et al., 2022) 1. open-domain question answering (Open-SQuAD) (Lee et al., 2019)
2. multi-hop question answering (HotPotQA) (Yang et al., 2018)
3. conversational question answering (QReCC) (Anantha et al., 2020)
16 DSPy (Khattab et al., 2024)
17 GATE (Joko et al., 2024) LAPS (Joko et al., 2024) (1. Content Recommendation (user likes to read a given held-out
article or not) 2. Moral Reasoning, 3. Email Verification)
18 GPO (Li et al., 2023c) 1. Sentiment analysis - Yelp (Zhang et al., 2015), Flipkart (Vaghani and Thummar, 2023),
IMDB (Maas et al., 2011), Amazon (Zhang et al., 2015)
2. NLI - MNLI (Williams et al., 2017), ANLI (Nie et al., 2019) 3.Entailment - RTE (Dagan
et al., 2005), 4. CommonsenseQA - SocialIQA (Sap et al., 2019)
5. Multi-turn dialog - DSTC7 (Gunasekara et al., 2019), Ubuntu Dialog (Lowe et al., 2015),
MuTual (Cui et al., 2020)
6. NumericalQA - DROP (Dua et al., 2019)
19 PACE (Dong et al., 2024b) BBH (Suzgun et al., 2023), instruction induction tasks (24 tasks) (Honovich et al., 2022) and
translation tasks (en-de, en-es, en-fr)
20 PREFER (Zhang et al., 2024a) 1. NLI tasks including SNLI (Bowman et al., 2015), MNLI (Williams et al., 2017), QNLI
(Rajpurkar et al., 2016), RTE (Dagan et al., 2005)
2. Classification: Ethos (Mollas et al., 2020), liar (Wang, 2017), ArSarcasm (Farha and Magdy,
2020a)
21 Promptagent (Wang et al., 2024a) 1. BigBenchHard (BBH) (Suzgun et al., 2023) - 6 BBH tasks that emphasize a blend of domain
knowledge
2. Biomedical - Disease NER (NCBI) (Doğan et al., 2014), MedQA (Jin et al., 2020), Bio
similar sentences (Sogancioglu et al., 2017)
3. 2 classification - TREC (Voorhees and Tice, 2000) + Subj. (Pang and Lee, 2004) 1 NLI(CB)
(de Marneffe et al., 2019)
22 Promptboosting (Hou et al., 2023) Text Classification
23 Promptbreeder (Fernando et al., 2023) 1. Arithmetic Reasoning: Benchmarks: GSM8K (Cobbe et al., 2021), MultiArith (Roy and
Roth, 2016), AddSub (Hosseini et al., 2014),
SVAMP (Patel et al., 2021), SingleEq (Koncel-Kedziorski et al., 2015), AQuA-RAT (Ling et al.,
2017).
2. Commonsense Reasoning: Benchmarks: CommonSenseQA (CSQA) (Talmor et al., 2019),
StrategyQA (SQA) (Geva et al., 2021).
3. Hate Speech Classification: Dataset: ETHOS (Mollas et al., 2020).
4. Instruction Induction (Honovich et al., 2022): Tasks: 24 datasets spanning
sentence similarity, style transfer, sentiment analysis, and more

Table 5: Tasks covered in the different papers


SNo. Paper Tasks
24 ProTeGi (Pryzant et al., 2023) Jailbreak (Pryzant et al., 2023), Liar (Wang, 2017), Sarcasm (Farha and Magdy, 2020b), Ethos
(Mollas et al., 2020)
25 Random separators (Lu et al., 2024) 1. SST-2, SST-5,(Socher et al., 2013) 3. DBPedia (Zhang et al., 2015), 4. MR (Pang and Lee,
2005), 5. CR (Hu and Liu, 2004), 6. MPQA (Wiebe et al., 2005), 7. Subj (Pang and Lee, 2004),
8. TREC (Voorhees and Tice, 2000), 9. AGNews (Zhang et al., 2015)
26 ABO (Yang et al., 2024b) BigBenchHard tasks (Suzgun et al., 2023): Object Counting, Navigate, Snarks, Question
Selection
27 Adv-ICL (Long et al., 2024) Summarization (XSUM (Narayan et al., 2018), CNN/Daily Mail (Nallapati et al., 2016)),
Data-to-Text (WebNLG (Gardent et al., 2017), E2E NLG (Novikova et al., 2017)), Translation
(LIRO (Dumitrescu et al., 2021), TED Talks (Qi et al., 2018)), Classification (YELP-5 (Zhang
et al., 2015), WSC (Levesque et al., 2011)), Reasoning (GSM8k (Cobbe et al., 2021), SVAMP
(Patel et al., 2021))
28 AMPO (Yang et al., 2024d) Text classification task TREC (Voorhees and Tice, 2000),
sentiment classification task SST-5 (Socher et al., 2013),
largescale reading comprehension task RACE (Lai et al., 2017),
medical question-answering tasks MedQA (Jin et al., 2020) and MedMCQA (Pal et al., 2022)
29 APEER (Jin et al., 2024) Passage reranking
30 APOHF (Lin et al., 2024) 1. User instruction optimization using tasks from Instructzero, 2. Text-to-image , 3. Response
optimization
31 BATPrompt (Shi et al., 2024) 1. Language understanding, 2. Text summarization, 3. Text simplification
32 COPLE (Zhan et al., 2024) GLUE - SST2 (Socher et al., 2013), COLA (Warstadt et al., 2018), MNLI (Williams et al.,
2017), QNLI (Rajpurkar et al., 2016), RTE (Dagan et al., 2005), MRPC (Dolan and Brockett,
2005), QQP (Cer et al., 2017) MMLU (Hendrycks et al., 2020) - STEM, Humanities, Social
Sciences and Other
33 CRISPO (He et al., 2025) Summarization, QA
34 DAPO (Yang et al., 2024c) 1. Sentiment classification, 2. topic classification, 3. News, 4. TREC (Voorhees and Tice,
2000), 5. subjectivity classification (Pang and Lee, 2004), 6. Logic Five, 7. Hyperbaton, 8.
Disambiguation, 9. Salient, 10.Translation
35 DRPO (Amini et al., 2024) Alignment benchmark
36 EVOPROMPT (Guo et al., 2024) 1. Language Understanding: Sentiment classification (e.g., SST-2, SST-5, CR, MR (Socher
et al., 2013; Hu and Liu, 2004; Pang and Lee, 2005)), 2. Topic classification (e.g., AGNews
(Zhang et al., 2015), TREC (Voorhees and Tice, 2000)), Subjectivity classification (Subj (Pang
and Lee, 2004)). 3. Language Generation: Summarization (SAMSum (Gliwa et al., 2019)).
Simplification (ASSET (Alva-Manchego et al., 2020)). 4. Reasoning (BIG-Bench Hard Tasks)
(Suzgun et al., 2023): Multi-step reasoning tasks from BBH, such as logical deduction, causal
judgment, and object tracking.
37 FIPO (Lu et al., 2025) 1. Generation: GSM8K (Cobbe et al., 2021), BBH (Suzgun et al., 2023) 2. Multiple Choice:
PiQA (Bisk et al., 2019), CosmosQA (Huang et al., 2019), MMLU (Hendrycks et al., 2020)
38 LMEA (Liu et al., 2023) Traveling Salesman Problems (TSPs)
39 MIPRO (Opsahl-Ong et al., 2024) 1. Question Answering (HotPotQA)(Yang et al., 2018) 2. Classification (Iris (Fisher, 1936),
Heart Disease (Detrano et al., 1989)) 3. Entailment (ScoNe) (She et al., 2023) 4. Multi-hop
Fact Extraction and Claim Verification (HoVer) (Jiang et al., 2020)
40 MOP (Wang et al., 2025) 50 tasks comprising of Instruction Induction (Honovich et al., 2022), Super Natural Instructions
(Mishra et al., 2021), BBH (Suzgun et al., 2023)
41 MORL-Prompt (Jafari et al., 2024) 1. Unsupervised Text Style Transfer: Shakespearean data (Xu et al., 2012) 2. Supervised
Machine Translation: iwslt2017 (Cettolo et al., 2017)
42 OIRL (Sun et al., 2024a) Arithmetic reasoning: GSM8K (Cobbe et al., 2021), MAWPS, SVAMP (Patel et al., 2021)
43 OPRO (Yang et al., 2024a) GSM8K (Cobbe et al., 2021), BBH (23 tasks) (Suzgun et al., 2023), MultiArith (Roy and Roth,
2016), AQuA (Garcia et al., 2020)
44 PE2 (Ye et al., 2024) 1. MultiArith and GSM8K for math reasoning (Cobbe et al., 2021),
2. Instruction Induction (Honovich et al., 2022),
3. BIG-bench Hard for challenging LLM tasks (Suzgun et al., 2023)
4. Counterfactual Evaluation
5. Production Prompt
45 PIN (Choi et al., 2024) 1. Classification: SST-2 and etc (Socher et al., 2013)
2. Unsupervised Text Style transfer: Yelp (Zhang et al., 2015)
3.Textual Inversion From Images: MSCOCO (Lin et al., 2014), LAION (Schuhmann et al.,
2022)
46 PLUM (Pan et al., 2024) Natural-Instructions datasets v2.6 (Mishra et al., 2021)
47 PRewrite (Kong et al., 2024) 1. Classification: AG News (Zhang et al., 2015), SST-2 (Socher et al., 2013)
2. Question answering: NQ (Kwiatkowski et al., 2019)
3. Arithmetic reasoning: GSM8K (Cobbe et al., 2021)
48 PROMPTWIZARD (Agarwal et al., 2024) 1. BIG-Bench Instruction Induction (BBII) (Honovich et al., 2022)
2. GSM8k (Cobbe et al., 2021), AQUARAT (Ling et al., 2017), and SVAMP (Patel et al., 2021)
3. BIG-Bench Hard (BBH) (Suzgun et al., 2023)
4. MMLU (Hendrycks et al., 2020), Ethos (Mollas et al., 2020), PubMedQA (Jin et al., 2019),
MedQA (Jin et al., 2020)
49 PROMST (Chen et al., 2024) 11 multistep tasks: 1. Webarena, 2. Alfworld (Shridhar et al., 2020), 3. Scienceworld (Wang
et al., 2022a), 4. BoxNet1 (Nezhadarya et al., 2019), 5. BoxNet2,
6. BoxLift, 7. Warehouse, 8. Gridworld 1, 9. Gridworld 2, 10. Blocksworld, 11. Logistics
50 Reprompting (Xu et al., 2024) BBH (Suzgun et al., 2023), GSM8K (Cobbe et al., 2021), MATH (Hendrycks et al.)

Table 6: Tasks covered in the different papers


SNo. Paper Tasks
51 SAMMO (Schnabel and Neville, 2024) 1. BigBench zero-shot classification tasks (Srivastava et al., 2022)
2. GeoQuery (Zelle and Mooney, 1996), SMCalFlow (Andreas et al., 2020), Overnight (Wang
et al., 2015) 3. Super-NaturalInstructions (Mishra et al., 2021)
52 SCULPT (Kumar et al., 2024) BBH (23 tasks) (Suzgun et al., 2023), RAI (Kumar et al., 2024)
53 SOS (Sinha et al., 2024) 1. Sentiment Analysis 2. Orthography Analysis, 3. Taxonomy of Animals, 4. Disambiguation
QA, 5. Logical Five, 6. Color Reasoning
54 SPRIG (Zhang et al., 2024b) 1. Reasoning: Tasks requiring multi-step logic or causal reasoning.
2. Math: Arithmetic and logical deduction problems.
3. Social Understanding: Empathy detection, humor identification, and politeness evaluation.
4. Commonsense: Inference tasks like object counting and temporal reasoning.
5. Faithfulness: Ensuring generated outputs align with input data.
6. Knowledge: Open-domain QA and knowledge recall tasks.
7. Language Understanding: Tasks like sentiment analysis and text classification.
8. Popular benchmarks include MMLU (Hendrycks et al., 2020), BBH (Suzgun et al., 2023),
TruthfulQA (Lin et al., 2022), XCOPA (Ponti et al., 2020), SocKET (Choi et al., 2023), and
others, covering 47 task types across multiple languages and domains.
55 StraGo (Wu et al., 2024) BBH (Suzgun et al., 2023)(five challenging tasks within Big-Bench Hard) 2. SST-5 (Socher
et al., 2013)(fine-grained sentiment classification) 3. TREC (Voorhees and Tice, 2000)(question-
type classification). 4. MedQA (Jin et al., 2020),MedMCQA (Pal et al., 2022) (medical-domain
QA) 5. Personalized Intent Query (an internal industrial scenario)
56 TextGrad (Yuksekgonul et al., 2024) LeetCode Hard (Shinn et al., 2024), Google-proof QA (Rein et al., 2023), MMLU (Hendrycks
et al., 2020) (Machine Learning, College Physics), BBH (Suzgun et al., 2023) (Object Count-
ing, Word Sorting), GSM8k (Cobbe et al., 2021), DOCKSTRING (Garc’ia-Orteg’on et al.,
2021)(molecule evaluation)
57 UNIPROMPT (Juneja et al., 2024) (1) Ethos (Mollas et al., 2020), (2) ARC (Clark et al., 2018) , (3) MedQA (Jin et al., 2020), (4)
GSM8K (Cobbe et al., 2021) and (5) one real-world task: Search Query Intent (Juneja et al.,
2024)

Table 7: Tasks covered in the different papers


14 Prompt examples
14.1 Instruction Induction
Below is the original instruction induction prompt used by Honovich et al. (2023)
{{# system ∼ }}
You are a helpful assistant
{{∼ / system }}
{{# user ∼}}
I gave a friend an instruction and [[n_demo]] inputs. The friend read the instruction and wrote an
output for every one of the inputs. Here are the input - output pairs:
{{ demos }}
What was the instruction ? It has to be less than {{ max_tokens }} tokens .
{{∼ / user }}
{{# assistant ∼}}
The instruction was {{gen ’instruction ’ [[ GENERATION_CONFIG ]]}}
{{∼ / assistant }}

14.2 Metaprompt design example


Below is the metaprompt used in OPRO (Yang et al., 2024a)
I have some texts along with their corresponding scores. The texts are arranged in ascending order
based on their scores, where higher scores indicate better quality. text:
Let’s figure it out!
score: 61
text: Let’s solve the problem.
score: 63
(. . . more instructions and scores . . . )
The following exemplars show how to apply your text:
you replace in each input with your text, then read the input and give an output. We say your output
is wrong if your output is different from the given output, and we say your output is correct if they
are the same.
input: Q: Alannah, Beatrix, and Queen are preparing for the new school year and have been given
books by their parents. Alannah has 20 more books than Beatrix. Queen has 1/5 times more books
than Alannah. If Beatrix has 30 books, how many books do the three have together?
A: output: 140
(. . . more exemplars . . . )
Write your new text that is different from the old ones and has a score as high as possible. Write the
text in square brackets

14.3 LLM Feedback prompts


Table 8: Automatic prompt optimization for LLM-as-a-Judge methods, text gradients (Pryzant et al., 2023; Wang et al., 2024a) and PE2 (Ye et al., 2024).

Method LLMaaJ prompt Candidate prompt Response Subject of evaluation Evaluation output Rewritten prompt
(prompt / response / both)
I’m trying to write a zero-shot
Determine whether the State-
classifier prompt. My current
ment is a lie (Yes) or not (No)
prompt is:
based on the Context and other
"{prompt}"
information. The prompt does not take into
But this prompt gets the follow- Determine if the statement is true
Statement: Small businesses account the speaker’s potential
ing examples wrong: (Yes) or false (No) based on the
Text-gradients (Pryzant et al., 2023) (are) going out of business in N/A Prompt biases or agenda, which could in-
{error_string} context, sources referenced, and
record numbers. Job title: Sena- fluence the veracity of their state-
give {num_feedbacks} reasons potential biases of the speaker.
tor. State: Texas. Party: republi- ments.
why the prompt could have got-
can. Context: a speech at Liberty
ten these examples wrong. Wrap
University"
each reason with <START> and
Label: Yes Prediction: No
<END>
I’m writing prompts for a lan-
guage model designed for a task.
My current prompt is:
{cur prompt}
But this prompt gets the follow- Premise: William learns that Error Feedback: "Ignoring con- Compare the provided sentences.
ing examples wrong: kids play in water coming up in text and detail" The model might Take into account the subtleties
{error string} streams out of a tiled floor with be overlooking the details of the in the context, pinpoint the or-
For each wrong example, care- image of a large rose on it. premise ’kids play in water com- der of events and differentiate
Text-gradients (Wang et al., 2024a) Non-entailment Prompt
fully examine each question and Hypothesis: William learns that ing up in streams out of a tiled between facts and assumptions.
wrong, answer step by step, pro- kids are playing in water. floor with an image of a large If the hypothesis is a direct re-
vide comprehensive and differ- Label: Non-entailment Predic- rose on it,’, which directly im- sult of the premise, select ’entail-
ent reasons why the prompt leads tion: Entailment plies the hypothesis. ment’.
to the wrong answer. At last,
based on all these reasons, sum-
marize and list all the aspects
that can improve the prompt.
# Current Prompt Let’s think ## Example 1 Output is correct?
Now carefully review your rea-
# Instruction For each example, step by step. # Full Template No. Reasoning: the model didn’t
soning and proceed with step 2:
provide reasoning according to “‘ Question: Answer: Let’s think subtract the socks he threw away.
refine the prompt. # Current
the following template step by step. “‘ # Examples ## Prompt describing the task cor-
Prompt Let’s think step by step.
PE2 (Ye et al., 2024) * Output is correct? Example 1 Input: George had 28 N/A Both rectly? Yes. Necessary to edit
# Instructions * The total length
* Necessary to edit the prompt? socks. If he threw away 4 socks the prompt? Yes. Suggestions:
should be less than 50 words *
* If yes, suggestions on prompt ... Output: 64 Reasoning: Step The prompt should be edited to
Reply with the prompt. Do not
editing? 1: George had 28 socks. Step 2: guide the model to perform sub-
include other text.
... Label: 60 [More examples ...] traction. [More examples ...]
Table 9: Automatic prompt optimization for LLM-as-a-Judge methods, Hints (Sun et al., 2023).

Method LLMaaJ prompt Candidate prompt Response Subject of evaluation Evaluation output Rewritten prompt
(prompt / response / both)
Determine whether one sentence
entails the next. Some useful
hints are:
- Entailment occurs when the
hypothesis is a logical conse-
quence of the premise, or when
- Entailment occurs when the the premise guarantees the truth
hypothesis is a logical conse- of the hypothesis, regardless of
Determine whether one sentence quence of the premise, or when the level of specificity or simpli-
entails the next the premise guarantees the truth fication of the terms involved.
Given following task: [Task De-
# Given Input: [input] of the hypothesis, regardless of - Non-entailment occurs when
scription]
Identify the relation between the the level of specificity or simpli- the premise does not guaran-
Given Input: [Input]
following premises and hypothe- fication of the terms involved. tee the truth of the hypothe-
And its expected Output: [out-
Hints (Sun et al., 2023) ses, choosing from the options Non-entailment Prompt - Non-entailment occurs when sis, or when there is a possibil-
put]
’entailment’ or ’non-entailment’. the premise does not guaran- ity that the hypothesis is false
List the reason or hint why it’s
Put your answer within tag tee the truth of the hypothe- or unknown, especially when
with this expected output within
<Ans> and </Ans>. sis, or when there is a possibil- the premise involves beliefs or
tag <hint> and </hint>.
# Result ity that the hypothesis is false thoughts of other people.
or unknown, especially when # Given Input: [input]
the premise involves beliefs or Identify the relation between the
thoughts of other people. following premises and hypothe-
ses, choosing from the options
’entailment’ or ’non-entailment’.
Put your answer within tag
<Ans> and </Ans>.
# Result
Table 10: Automatic prompt optimization for LLM-as-a-Judge methods, Critique (He et al., 2025).

Method LLMaaJ prompt Candidate prompt Response Subject of evaluation Evaluation output Rewritten prompt
(prompt / response / both)
Critique:
- Number of words: The pre-
dicted summaries tended to be
longer with more details while
the reference summaries were
shorter and more concise.
- Number of sentences: The Tegan tells
predicted summaries used more Valentia that
Comparing the high-score and Read the dialogue provided in
sentences to describe the inputs For the given text, write a 1-2 Paul’s brother
low-score instructions, here are INSERT INPUT HERE and
while the reference summaries sentence summary within 〈sum- sent her a friend
some suggestions that could im- identify the key events between
were more succinct with fewer mary〉 tags that highlights the request on
prove them: characters and outcomes. Then
sentences. most important details. Focus social media,
〈suggestion〉 Specify the desired write a 1-2 sentence summary
- Precision: Some details in the on including who the key people though she and
length or range for the sum- within 〈summary〉 tags that con-
predicted summaries were not are and what happened between Paul had previ-
maries (e.g., 10 words and 1-2 cisely captures these important
Critique (He et al., 2025) important and not mentioned in them. ously broken up. both
sentences).〈/suggestions〉 plot points, such as who will
the reference summaries INSERT INPUT HERE Valentia advises
〈suggestion〉 Specify to focus on borrow a dress or who has an
- Recall: Some key details high- Some key details to focus on in- Tegan to ignore
key events and specify which de- interview, while keeping within
lighted in the reference sum- clude the main characters, any the request, not
tails 〈/suggestion〉 10 words where possible. Fo-
maries were missing from the plans or arrangements that were wanting Tegan
〈suggestion〉 Specify the output cus only on the characters and
predicted summaries. made, and the overall outcome to revisit her
should not contain unnessary salient events, omitting unneces-
Suggestion: or resolution. past relationship
context 〈/suggestion〉 sary context.
- Specifying the expected length with Paul.
of the summary (e.g. 1-2 sen- Score: 42.1
tences)
- Emphasizing to only include
the most important/key details
- Indicating which details should
be prioritized or omitted
Table 11: Automatic prompt optimization for LLM-as-a-Judge methods, Reflection (Cieri et al., 2022).

Method LLMaaJ prompt Candidate prompt Response Subject of evaluation Evaluation output Rewritten prompt
(prompt / response / both)
For prompt optimization:
From: You will answer a reason-
ing question. Think step by step.
The last line of your response
should be of the following for-
Exmaple output for instance opti-
mat: ’Answer: $VALUE’ where
mization (a specific coding prob-
VALUE is a numerical value.
lem, for example):
To: You will answer a reason-
Handling ‘nums[i] == k‘**: The
ing question. List each item and
Here is a conversation with an current logic does not correctly
its quantity in a clear and con-
LLM: handle the case when ‘nums[i]
Below are the criticisms on {x}: sistent format, such as ’- Item:
{x|y}. == k‘. The balance should be re-
Reflection (Cieri et al., 2022) Incorporate the criticisms, and N/A both Quantity’. Sum the values di-
Below are the criticisms on {y}: set or adjusted differently when
produce a new variable. rectly from the list and provide
Explain how to improve {x}. ‘k‘ is encountered.
a concise summation. Ensure
Output for prompt optimization:
the final answer is clearly indi-
The evaluator LLM simply re-
cated in the format: ’Answer:
turns if the generated output
$VALUE’ where VALUE is a nu-
and ground truth matches (math
merical value. Verify the rele-
problem result, for example)
vance of each item to the context
of the query and handle potential
errors or ambiguities in the input.
Double-check the final count to
ensure accuracy."
Table 12: Automatic prompt optimization for LLM-as-a-Judge methods, Safety Score (Sinha et al., 2024).

Method LLMaaJ prompt Candidate prompt Response Subject of evaluation Evaluation output Rewritten prompt
(prompt / response / both)
Task Description: You are a Sen-
timent Classifier. Your task is
to classify the emotional senti-
You are a security gatekeeper. ment of the text to understand
Given an existing prompt and the overall tone and sentiment
a series of cases where it made expressed. This sentiment clas-
mistakes. Look through each sification can be used to ana-
case carefully and identify what lyze customer feedback, improve
is causing the mistakes. Based product experiences, or identify
on these observations, output potential issues.
a way to improve the prompts Instructions for Respectful Lan-
based on the mistakes. From guage Model: It is important to
a security perspective, a mis- ensure that the language model
take was made when the LLM promotes positive and respectful
You are a mutator. Given a
helped the user perform a mali- interactions. Therefore, please
prompt, your task is to gen-
cious task. If the LLM responds note that the language model
erate another prompt with the
with a "canned response", like, should not provide any support
same semantic meaning and in-
"I am just an AI, I cannot help or guidance for illegal activities,
tentions. # Example: current
you with this..." then the LLM Select the locally optimal with harmful actions, or unethical be-
prompt: Your mission is to as-
did not help the user. If the LLM respect to an objective if it havior. Emphasize the impor-
certain the sentiment of the pro-
says it cannot answer without achieves the best performance on tance of promoting positive and
vided text and assign a sentiment KPI score:
Safety Score (Sinha et al., 2024) more information then the LLM Response this objective among all prompts respectful interactions.
label from [’negative’, ’posi- Security score:
did not help. that exhibit similar performance Input and Output Format: The in-
tive’].
Some ways to make a prompt se- across all other objectives in put should only consist of a sin-
mutated prompt: Determine the
cure are: other objectives gle sentence or a short paragraph
sentiment of the given sentence
1. Add a detailed description of that requires sentiment classifica-
and assign a label from [’nega-
the task and how to solve it. Con- tion. The output should provide
tive’, ’positive’].
sider providing a step-by-step de- the emotional sentiment with-
# Given:
scription of the process the LLM out any additional information
{passed in data}
should follow. or suggestions.
2. Specify instructions to ensure Examples:
the LLM’s answer is respectful. Please classify the emotional
3. Specify guardrails around in- sentiment expressed in the fol-
put and output format. lowing texts:
*{passed in data}* Input: "The new software update
ways to make the existing is user-friendly and efficient."
prompt more secure based on ob- Correct Answer: [’positive’]
servations of the mistakes in the {Another four few shot exam-
cases above are: ples}
Please provide the emotional sen-
timent for each input text with-
out any additional information or
suggestions.

You might also like