A Systematic Survey of Automatic Prompt Optimization Techniques
A Systematic Survey of Automatic Prompt Optimization Techniques
Kiran Ramnath, Kang Zhou, Sheng Guan, Soumya Smruti Mishra, Xuan Qi, Zhengyuan Shen,
Shuai Wang, Sangmin Woo, Sullam Jeoung, Yawei Wang, Haozhu Wang, Han Ding,
Yuzhe Lu, Zhichao Xu, Yun Zhou, Balasubramaniam Srinivasan, Qiaojing Yan, Yueyan Chen,
Haibo Ding, Panpan Xu, and Lin Lee Cheong
Amazon Web Services
{raxkiran,zhoukang,shguan,soumish,xuaqi,donshen, wshui,sangminw,sullamij,
yawenwan, haozhuw, handing, yuzhelu, xzhichao, yunzzhou, srbalasu, qiaojiny,
yyanc, hbding, xupanpan, lcheong}@amazon.com
cial step for eliciting desired responses for vari- survey paper, we aim to highlight the advances in
ous Natural Language Processing (NLP) tasks. the field. Our core contribution is a 5-part APO
However, prompt engineering remains an im-
taxonomy combined with a comprehensive fine-
pediment for end users due to rapid advances
in models, tasks, and associated best practices. grained categorization of various design choices
To mitigate this, Automatic Prompt Optimiza- therein (see Fig. 1, Tables 2, 3, 4 in Appendix). We
tion (APO) techniques have recently emerged hope our framework will be informational for new
that use various automated techniques to help and seasoned researchers alike, enabling further
improve the performance of LLMs on various research on open questions.
tasks. In this paper, we present a comprehen-
sive survey summarizing the current progress 2 Automatic Prompt Optimization
and remaining challenges in this field. We pro- Formulation
vide a formal definition of APO, a 5-part uni-
We formalize the process of automatic prompt op-
fying framework, and then proceed to rigor-
ously categorize all relevant works based on timization (APO) as follows. Given a task model
their salient features therein. We hope to spur Mtask , initial prompt ρ ∈ V , the goal of an APO-
further research guided by our framework. system MAP O is to obtain the best performing
prompt-template ρopt under a metric f ∈ F and
1 Introduction eval-set Dval
Since McCann et al. (2018) cast multi-task NLP ρopt := arg max Ex∼Dval [f (Mtask (ρ ⊕ x))] (1)
ρ∈V
as Question Answering, using prompts as inputs
has become the standard way to elicit desired re- This objective function is not tractable for discrete
sponses from Large Language Models (LLMs). prompt optimization as token-sequence search
Furthermore, LLMs’ few-shot learning (Brown spaces are combinatorial. Instead, APO techniques
et al., 2020), instruction-following (Ouyang et al., follow the general anatomy as described in Algo-
2022), and zero-shot reasoning capabilities (Ko- rithm 1 to obtain approximate solutions.
jima et al., 2023) have led to a widespread prolif- 3 Initialize Seed Prompts
eration of prompting tricks for various tasks and
3.1 Manual Instructions
model variants. However, LLMs still exhibit unpre-
dictable sensitivity to various factors (explanation Several approaches use a seed of manually cre-
of the task (Li et al., 2023b),ordering (Liu et al., ated instructions that offer interpretable and strong
2024a), stylistic formatting (Sclar et al.), etc.) caus- baselines as the basis for further improvement,inter
ing a performance gap between two prompts that alia., ProteGi (Pryzant et al., 2023), GPS (Xu et al.,
are semantically similar, thereby adding impedi- 2022), SPRIG (Zhang et al., 2024b). While ob-
ments for adoption by end users. Against this back- taining quality examples can be costly, APE (Zhou
drop, Black-Box Automatic Prompt Optimization et al., 2022) 1 showed that a few hundred samples
(APO) techniques have emerged that improve task are sufficient for further optimization.
performance via automated prompt improvements. 1
Note: APE stands for Automatic Prompt Engineer method
The possess various attractive features - (1) they do introduced by (Zhou et al., 2022), not to be confused with APO
Manual Instructions §3.1
Seed Prompts §3
Instruction-induction via LLMs §3.2
Inference evaluation
and feedback §4 Improving single candidate §4.2.1
LLM Feedback §4.2
Improving multiple candidates §4.2.2
Prompt optimization anatomy §2
Table 1: Comparison of some APO techniques under our framework (Tables 2,3,4 show full comparison)
vide multiple feedback data and broadly fall into to generate several prompt candidates for evalu-
two categories - improving a single prompt candi- ation in the next iteration. PromptAgent (Wang
date versus improving multiple prompt candidates et al., 2024a) similarly used an error collection ap-
(discussed below, examples in Appendix 14.3). proach to emulate expert-written prompts that con-
sisted of clear sections like “Task description”, “Do-
4.2.1 Improving Single Candidate
main Knowledge”, “Solution Guidance”, “Excep-
SCULPT (Kumar et al., 2024) introduces a system- tion Handling”, “Output Formatting”. PREFER
atic method for tuning long, unstructured prompts (Zhang et al., 2024a) utilizes a feedback-reflect-
by employing a hierarchical tree structure and refine cycle to aggregate feedback into multiple
two-step feedback loops - preliminary assessment prompts in an ensemble to improve the model’s
and error assessment - to evaluate and correct ability to generalize across various tasks. Sur-
prompts before and after execution. The feed- vival of the Safest (SOS) (Sinha et al., 2024) added
back updates the hierarchical prompt tree which is safety-score into a multi-objective prompt opti-
then back-synthesized into a new prompt candidate. mization framework that used an interleaved strat-
PACE (Dong et al., 2024b) applies an actor-critic egy to balance performance and security in LLMs
editing framework to the prompt refinement pro- simultaneously. To avoid accidentally damaging
cess itself, allowing for more dynamic and adaptive well-functioning prompts, StraGo (Wu et al., 2024)
adjustments. Overcoming the limitations of opti- summarized strategic guidance based on both cor-
mizing a single metric, CRISPO (He et al., 2025) rect and incorrect predictions as feedback.
adopts a multi-aspect critique-suggestion meta-
4.3 Human-feedback
prompt to highlight flaws in the generated response
across multiple dimensions such as style, preci- A few works also incorporate human feedback, ei-
sion, and content alignment. Thereafter it leverages ther during compile-time or inference-time in the
detailed, aspect-specific feedback and iteratively prompt construction / optimization process. Joko
updates the prompts. Autohint (Sun et al., 2023) et al. (2024) proposed “Generative Active Task
summarizes feedback for multiple incorrect infer- Elicitation” to better capture human preferences.
ences via hints to instill improvements into a single It prompts a language model to interactively ask
prompt candidate. questions and infer human preferences conditioned
on the history of free-form interaction. Cheng et al.
4.2.2 Improving Multiple Candidates
(2024) trained a smaller LLM to optimize input
ProTeGi (Pryzant et al., 2023) and TextGrad (Yuk- prompts based on user preference feedback, achiev-
sekgonul et al., 2024) leverage textual “gradients” ing up to 22% increase in win rates for ChatGPT
to guide the discrete prompt optimization proce- and 10% for GPT-4. PROMST (Chen et al., 2024)
dure, very similar to the gradient-descent style of tackles the challenges of multi-step tasks by in-
continuous prompt optimization approaches. Dif- corporating human-designed feedback rules and a
ferent from continuous gradient-descent, ProTeGi learned heuristic model. APOHF (Lin et al., 2024)
sampled multiple “gradients” i.e. directions of focuses on optimizing prompts using only human
improvement, and each such “gradient” is used preference feedback rather than numeric scores,
employing a dueling bandits-inspired strategy to needing repeated task-specific optimizations.
efficiently select prompt pairs for preference feed-
LLM-based mutation: LMEA (Liu et al., 2023),
back, proving effective for tasks like text-to-image
SOS (Sinha et al., 2024), and StraGo (Wu et al.,
generation and response optimization.
2024) uses mutation prompts with LLMs to over-
5 Candidate Prompt Generation come the traditional complexity of designing tai-
In this step, one or more candidate prompts are gen- lored operators for cross-over / mutation. Prompt-
erated that are most likely to result in an improve- Breeder (Fernando et al., 2023) advocates self-
ment in a metric of interest f ∈ F . The approaches referential improvement of all prompts in the
reviewed below range from simple rule-based ed- prompt optimization system - Direct Mutation of
its (sec. 5.1) to sophisticated agentic systems that task prompts, Hypermutation of mutation prompts
combine with LLM-based evaluations (sec. 4.2) themselves, Lamarckian Mutation where prompts
and various filtering strategies (sec. 6). are reverse-engineered from successful examples
(similar to Instruction Induction Honovich et al.
5.1 Heuristic-based Edits (2023), and finally Crossover and Shuffling to im-
Several works proposed heuristic-based mecha- prove diversity of the prompt pool. EvoPrompt
nisms to make edits to intermediate prompt can- (Guo et al., 2024) use Differential Evolution -
didates to generate newer candidates. They range where differences between existing prompts is in-
from edits at the word / phrase / sentence-level corporated to form new prompt candidates to over-
(either simple rule-based or LLM-generated), or come the problem of local optima. AELP (Hsieh
metric-driven incremental search. While these et al., 2024) also uses mutation operators to per-
strategies may not result in the most optimal so- form sentence-level edits in an iterative fashion.
lution, they help in making the discrete prompt They include sentence-level histories of reward
optimization problem computationally tractable. {(st−1 , st , rt )} in the mutation prompt in order
to avoid local optima and accidentally returning
5.1.1 Monte Carlo Sampling
to sub-optimal versions. GPS (Xu et al., 2022)
ProTeGi (Pryzant et al., 2023) uses Monte carlo used Back-translation, Sentence Continuation, and
sampling to explore combinatorial discrete solution Cloze transformations to perform prompt mutation.
spaces in an incremental fashion - it samples multi- PromptWizard (Agarwal et al., 2024) proposed a
ple textual gradients to use to generate prospective pipeline combining several steps including itera-
candidates, and spawns paraphrases as monte-carlo tive improvement, few shot example synthesis and
successors for evaluation. PromptAgent (Wang selection, utilizing LLM’s reasoning capability to
et al., 2024a) uses a tree-variant called Monte Carlo improve and validate the prompt, and finally an
Tree Search (MCTS) which consists of 4 steps — expert persona to ensure consistency of the style of
Selection, Expansion, Simulation, and Backpropa- generated prompts.
gation (also explained in Sec. 6).
5.1.3 Word / Phrase Level Edits
5.1.2 Genetic Algorithm
Several word-edit approaches first identify "influ-
A significant line of work applies the well-studied ential" tokens in the prompts. COPLE (Zhan et al.,
genetic algorithms to make discrete edits to texts. 2024) argued that LLMs exhibit lexical sensitivity,
The common recipe for several genetic algorithms showing that merely replacing a few words with
is 1/ Mutate and 2/ Cross-over components from their synonyms can yield significant improvements.
promising candidates. Token mutations: SPRIG First, “influential” tokens are identified where ex-
(Zhang et al., 2024b) and CLAPS perform token- pected loss on dev-set EDval [L(y, ŷ)] drops the
level mutations. SPRIG uses a starting corpus of most after removing that token versus the original
300 components grouped into categories like COT, prompt, and then influential tokens are replaced
roles, styles, emotions, scenarios, and good prop- using predictions from a Masked-Language Mod-
erties. It performs add/rephrase/swap/delete, high- els. This token-replacement approach is also at-
lighting complementary strengths of optimizing tractive as a standalone post-processing step for
system prompts alongside task-prompts (via meth- long prompts that are already optimized using other
ods like ProTeGi) to enhance accuracy across mul- LLM-based approaches. GRIPS (Prasad et al.,
tiple diverse domains, languages, and tasks without 2023) argues that phrase level edition is an effec-
tive and interpretable method to optimize prompts, perform prompt optimizations to preserve privacy
leveraging 4 basic edit operations -add, delete, para- and adapt to target models better leveraging both
phrase, and swap data diversification and strategic fine-tuning such
as SFT, preference optimization, and iterative pref-
5.1.4 Vocabulary Pruning
erence learning.
Some works prune the vocabulary space V to
5.2.3 Generative Adversarial Networks
Vpruned for decoding the next token for the op-
timized prompt ρ∗ . CLAPS (Zhou et al., 2023) Long et al. (2024) framed the prompt optimization
argued that general search spaces are highly re- process in the GAN setting. The LLM generator
dundant and use K-means clustering to find word- takes question and the generation prompt to pro-
clusters and retain top-2000 words closest to cluster duce output. The (input, output) pairs are evaluated
centroids. BDPL (Diao et al., 2022) used pairwise by an LLM powered discriminator, whose goal is
mutual information (PMI) to retain top co-occuring to identify generated pairs from ground truth pairs.
ngrams for decoding. PIN (Choi et al., 2024) in- Both generator and the discriminator are jointly op-
stead added regularization in the form of Tsallis- timized using adversarial loss, by utilizing a prompt
entropy (ideal for heavy-tailed distributions like modifier LLM to rewrite their prompts.
natural language) for the RL training of a prompt 5.3 Metaprompt Design
generation network, to reduce the probability mass
for unlikely tokens and improve interpetability. PE2 (Ye et al., 2024) argued that previous works
under-explored meta-prompt search space. OPRO
5.2 Editing via Auxiliary Trained NN (Yang et al., 2024a) proposes a meta-prompt design
Some approaches leverage a trained auxiliary neu- (see Appendix 14.2) which includes the optimiza-
ral network to edit the initial prompt for ob- tion problem description in natural language and
taining desired improvements. We include ap- previously generated solutions (multiple solutions
proaches where the finetuned network is different per stage for diversity) and scores alongisde the
and smaller than the task network. meta-instruction for prompt refinement. DAPO
(Yang et al., 2024c) utilizes a well-designed meta-
5.2.1 Reinforcement-learning
instruction to guide the LLM in generating high-
Multi-objective Optimization techniques (Jafari quality and structured initial prompts (contain task-
et al., 2024) demonstrate superiority over simple specific info, e.g. task type and description, output
reward averaging, particularly through volume- format and constraints, reasoning process, profes-
based methods that effectively balance competing sional tips) by observing given input-output ex-
objectives. Dynamic prompt modification strate- emplars. Then, DAPO iteratively optimizes the
gies, introduced through prompt rewriting (Kong prompts at the sentence level, leveraging previous
et al., 2024), directional stimulus prompting (Li tuning experience to expand prompt candidates.
et al., 2023d) and test-time editing (Zhang et al.,
5.4 Coverage-based
2022) solve the important goal of moving beyond
static prompt generation. Prompt-OIRL (Sun et al., Some approaches seek to "cover" the entire prob-
2024a) also tackled test-time optimization objec- lem space - either within a single prompt, or using
tive by learning an offline reward model and multiple prompts working individually or in an en-
subsequently using a best-of-N strategy to recom- semble during inference.
mend the optimal prompt in a query-dependent 5.4.1 Single Prompt-expansion
fashion. BDPL (Diao et al., 2022) optimized dis-
crete prompts using variance-reduced policy gradi- AMPO (Yang et al., 2024d) uses LLM feedback
ent algorithm to estimate gradients, allowing user to enumerate all the failure cases based on the
devices to fine-tune tasks with limited API calls. evaluation-set Dval and then enlists each of them in
the meta-instruction in an if-then-else format using
5.2.2 Finetuning LLMs 3 modules - 1/ Pattern Recognition, 2/ Branch Ad-
BPO (Cheng et al., 2024) trains a smaller 7B model justment, and 3/ Branch Pruning to decide whether
to align itself to task-performance on individual to enhance existing branches, or to grow new
LLMs using reward-free alignment. FIPO (Lu branches. Similarly, UNIPROMPT focused on ex-
et al., 2025) trains a local model (7B - 13B) to plicitly ensuring that various semantic facets of a
task get represented in the final prompt. It designs a (Opsahl-Ong et al., 2024) automates the optimiza-
human-like (manual) prompt engineering approach tion of multi-stage language model programs by
(UniPrompt) with two stages: a) task facets ini- improving instructions and demonstrations for each
tialization using background knowledge, and b) module. SAMMO (Schnabel and Neville, 2024)
refinement using examples. proposed symbolic prompt programming, repre-
senting prompts as directed-acyclic-graphs (DAG).
5.4.2 Mixture of Experts
A set of user-defined node mutation rules guide the
Wang et al. (2025) introduced the Mixture-of- mutation-search to find the optimal DAG, which is
Expert-Prompts where each expert is a task-prompt then converted back to a prompt.
to be used for specialized inference. MOP first
6 Filter and Retain Promising Prompts
clusters all demonstrations using K-means cluster-
ing. Then, the Region-based Joint Search (RBJS) In this step, promising prompt candidates are fil-
(sec.6.3) algorithm generates the appropriate in- tered for further optimization.
struction for each exemplar-cluster via instruction 6.1 TopK Greedy Search
induction (sec.3.2) based on a mix of in-cluster
and out-of-cluster demonstrations to cover “blind- The simplest mechanism to iteratively search
spots”. During inference, a single expert prompt is through prompt candidate sets is a greedy topK
invoked whose cluster centroid µc is closest to the search where in each iteration of the optimiza-
instance-embedding arg minC ||ϕ(xi ) − µc ||2 . tion, the top-K best-performing candidates on mini-
batch of data instances Dval are retained for further
5.4.3 Ensemble Methods iterations (e.g. - ProTeGi, AELP. This differs from
PromptBoosting (Hou et al., 2023), Boosted- beam-search which judges partial solutions’ based
Prompting (Pitis et al., 2023), PREFER (Zhang on the reward for the entire trajectory of prompt
et al., 2024a), etc. are ensemble methods that in- edits r({ρ11 , ρ12 , . . . , ρ1t }).
voke multiple prompts during inference and com- 6.2 Upper Confidence Bound and Variants
bine them to generate the final output ŷ = y0 +
Relying on a single static evaluation dataset can
Σm βi yi . GPO (Li et al., 2023c) also uses labeled
lead to biases in the selection procedure and finally
source data to generate an ensemble of prompts,
suboptimal solutions. ProTeGi, SPRIG, inter alia,
which are applied to unlabeled target data to gener-
cast the candidate prompt selection problem as that
ate output through majority voting.
of bandit search - identifying the most suitable
5.5 Program Synthesis arm (prompt candidate) operating on a fixed com-
Program-synthesis based approaches transform putation budget. They use the Upper Confidence
LLM pipelines into structured, modular compo- Bounds (UCB, Algorithm 2) which balances explo-
nents that can be systematically optimized and ration with exploitation. In each iteration of prompt
composed. These optimization techniques itera- optimization, they sample a different evaluation
tively refine instructions and demonstrations for dataset Dsample ∈ Dval , and maintain a moving
each module to improve the entire pipeline’s per- estimate of the optimality of each arm (i.e. prompt).
formance, DSP (Khattab et al., 2022) introduces In each iteration, the playout filters top-B prompt
a three-stage framework for retrieval-augmented candidates with the greatest score for further ex-
inference: Demonstrate (generates task-specific ploration. PromptAgent uses a variation of UCB
demonstrations), Search (retrieves relevant infor- called UCB for Trees (UCT) which are used in the
mation), and Predict (combines retrieved info with setting of contextual bandits (i.e. the action-space
demonstrations). DSPY (Khattab et al., 2024) and the reward function is state-dependent). AELP
transforms LLM pipelines into text transformation (Hsieh et al., 2024) used a modification called Lin-
graphs - introducing parameterized models, learn- ear UCB (Li et al., 2010) which uses a closed form
ing through demonstrations, and a compiler that op- linear estimate based on the reward trajectories of
timizes pipelines. DLN (Sordoni et al., 2023) simi- previously sampled edits as well as prompt embed-
larly considers chained LLM calls as stacked deep ding ϕ(s) to select the next best-arm.
language networks performing variational infer-
ence, where the learnable parameters for each layer
are task-decomposed prompt templates. MIPRO
6.3 Region-based Joint Search settings. Barring a few tasks covered by Joko et al.
MOP (Wang et al., 2025) proposes a Mixture- (2024); Sun et al. (2024a); Zhang et al. (2022);
of-Expert-Prompts performing prompt optimiza- Choi et al. (2024), inference-time optimization of
tion for each expert individual. Once C exemplar- multiple unknown tasks is underexplored. More
clusters are identified, the RBJS search first sam- robust evaluations are needed for task-agnostic
ples examples Dexemplars ∈ DC ∪ D \ DC , and APO systems combining seen and unseen tasks.
then uses APE to induct and optimize each expert 9.2 Unclear Mechanisms
instruction. Melamed et al. (2024) showed that prompts have
6.4 Metaheuristic Ensemble so-called ’evil twins’ that are uninterpretable yet
PLUM (Pan et al., 2024) library offered a meta- recover some of the performance of gold-standard
heuristic ensemble of different search algorithms prompts. Lu et al. (2024) showed that rare gib-
like Hill climbing, Simulated Annealing, Genetic berish strings can serve as competitive delimiters
Algorithms, Tabu Search, and Harmony Search. τ in prompts. Yang et al. (2024b) showed that
self-reflection by LLMs can suffer from incorrect
7 Iteration Depth error identification, prior biases, semantic invalid-
7.1 Fixed Steps ity, leading to failure in yielding improved prompts.
More studies are needed to better uncover the mech-
Most approaches choose to carry out the prompt
anisms of prompt optimization.
optimization for a fixed number of steps N.
9.3 APO for System Prompts / Agents
7.2 Variable number of steps
Although SPRIG explored optimizing system
GRIPS (Prasad et al., 2023) concludes search when
prompts in chat-style settings, scalability remains
successive iterations with negative gains breach
a challenge - optimizing system prompts required
a patience parameter, whereas PromptAgent con-
a predefined corpus and close to 60 hours whereas
cluded APO when rt ≤ ϵmin ∨ rt ≥ ϵmax .
Protegi only needed 1̃0 minutes per task. Similarly,
8 Theoretical Perspectives optimizing prompts for several components in an
agentic system in a concurrent fashion poses an
8.1 Upper Bound of Improvement from APO
exciting direction for future research.
AlignPro (Trivedi et al., 2025) establishes an upper
9.4 Multimodal APO
bound on the gains realizable from discrete prompt
optimization under a given prompt optimizer and Recently, textual prompt optimization has ex-
also a suboptimality-gap w.r.t. RLHF-optimal pol- panded to multimodal domains: text-to-image (Liu
icy π ∗ , while a lower bound is left unexplored. et al., 2024b; Mañas et al., 2024; Liu et al., 2024d),
text-to-video (Ji et al., 2024), text-to-audio (Huang
8.2 Other Related Perspectives
et al., 2023), and text-image alignment models like
Bhargava et al. (2024) proposed a control theo- CLIP (Du et al., 2024; Mirza et al., 2024). Be-
retic framework to establish bounds on the set of yond textual prompts, Huang et al. (2023) explore
reachable LLM-outputs for self-attention in terms optimizing multimodal inputs, such as images, to
of the singular values of its weight matrices. Liu elicit better responses from large multimodal mod-
et al. (2024c) showed the existence of a strong els. However, the interplay between modalities in
transformer that can approximate any sequence-to- prompt optimization remains underexplored. Fu-
sequence Lipschitz function. They also showed the ture research could develop APO frameworks to
existence of “difficult” datasets that depth-limited jointly optimize multimodal prompts (eg - remove
transformers could not commit to memory. background noise from audio, add visual markers
9 Challenges and Future Directions to videos, etc.) to fully leverage their synergies.
Qiao Jin, Bhuwan Dhingra, Zhengping Liu, William Tom Kwiatkowski, Jennimaria Palomaki, Olivia Red-
Cohen, and Xinghua Lu. 2019. Pubmedqa: A dataset field, Michael Collins, Ankur Parikh, Chris Alberti,
for biomedical research question answering. In Pro- Danielle Epstein, Illia Polosukhin, Jacob Devlin, Ken-
ceedings of the 2019 Conference on Empirical Methods ton Lee, et al. 2019. Natural questions: a benchmark
in Natural Language Processing and the 9th Interna- for question answering research. Transactions of the
tional Joint Conference on Natural Language Process- Association for Computational Linguistics, 7:453–466.
ing (EMNLP-IJCNLP), pages 2567–2577.
Guokun Lai, Qizhe Xie, Hanxiao Liu, Yiming Yang, and
Hideaki Joko, Shubham Chatterjee, Andrew Ramsay, Eduard Hovy. 2017. Race: Large-scale reading com-
Arjen P De Vries, Jeff Dalton, and Faegheh Hasibi. prehension dataset from examinations. arXiv preprint
2024. Doing personal laps: Llm-augmented dialogue arXiv:1704.04683.
construction for personalized multi-session conversa-
tional search. In Proceedings of the 47th International Kenton Lee, Ming-Wei Chang, and Kristina Toutanova.
ACM SIGIR Conference on Research and Development 2019. Latent retrieval for weakly supervised open do-
in Information Retrieval, pages 796–806. main question answering. ArXiv, abs/1906.00300.
Gurusha Juneja, Nagarajan Natarajan, Hua Li, Jian Jiao, Brian Lester, Rami Al-Rfou, and Noah Constant. 2021.
and Amit Sharma. 2024. Task facet learning: A struc- The power of scale for parameter-efficient prompt tun-
tured approach to prompt optimization. arXiv preprint ing. In Proceedings of the 2021 Conference on Empir-
arXiv:2406.10504. ical Methods in Natural Language Processing, pages
3045–3059, Online and Punta Cana, Dominican Repub-
David Jurgens, Srijan Kumar, Raine Hoover, Daniel A. lic. Association for Computational Linguistics.
McFarland, and Dan Jurafsky. 2018. Measuring the
evolution of a scientific field through citation frames. Hector J. Levesque, Ernest Davis, and L. Morgenstern.
Transactions of the Association for Computational Lin- 2011. The winograd schema challenge. In AAAI Spring
guistics, 6:391–406. Symposium: Logical Formalizations of Commonsense
Reasoning.
Omar Khattab, Keshav Santhanam, Xiang Lisa Li,
David Hall, Percy Liang, Christopher Potts, and Matei Bei Li, Rui Wang, Junliang Guo, Kaitao Song, Xu Tan,
Zaharia. 2022. Demonstrate-search-predict: Com- Hany Hassan, Arul Menezes, Tong Xiao, Jiang Bian,
posing retrieval and language models for knowledge- and JingBo Zhu. 2023a. Deliberate then generate: En-
intensive nlp. arXiv preprint arXiv:2212.14024. hanced prompting framework for text generation.
Cheng Li, Jindong Wang, Yixuan Zhang, Kaijie Zhu, Xiaogeng Liu, Zhiyuan Yu, Yizhe Zhang, Ning Zhang,
Wenxin Hou, Jianxun Lian, Fang Luo, Qiang Yang, and Chaowei Xiao. 2024c. Automatic and universal
and Xing Xie. 2023b. Large language models under- prompt injection attacks against large language models.
stand and can be enhanced by emotional stimuli. arXiv
preprint arXiv:2307.11760. Yilun Liu, Minggui He, Feiyu Yao, Yuhe Ji, Shimin Tao,
Jingzhou Du, Duan Li, Jian Gao, Li Zhang, Hao Yang,
Lihong Li, Wei Chu, John Langford, and Robert E. et al. 2024d. What do you want? user-centric prompt
Schapire. 2010. A contextual-bandit approach to per- generation for text-to-image synthesis via multi-turn
sonalized news article recommendation. In Proceedings guidance. arXiv preprint arXiv:2408.12910.
of the 19th International Conference on World Wide
Web, WWW ’10, page 661–670, New York, NY, USA. Xuan Do Long, Yiran Zhao, Hannah Brown, Yuxi Xie,
Association for Computing Machinery. James Xu Zhao, Nancy F. Chen, Kenji Kawaguchi,
Michael Shieh, and Junxian He. 2024. Prompt opti-
Moxin Li, Wenjie Wang, Fuli Feng, Yixin Cao, Jizhi mization via adversarial in-context learning. In Pro-
Zhang, and Tat-Seng Chua. 2023c. Robust prompt opti- ceedings of the 62nd Annual Meeting of the Association
mization for large language models against distribution for Computational Linguistics (Volume 1: Long Papers),
shifts. In Proceedings of the 2023 Conference on Em- pages 7308–7327, Bangkok, Thailand. Association for
pirical Methods in Natural Language Processing, pages Computational Linguistics.
1539–1554, Singapore. Association for Computational
Linguistics. Ryan Lowe, Nissan Pow, Iulian Serban, and Joelle
Pineau. 2015. The ubuntu dialogue corpus: A large
Zekun Li, Baolin Peng, Pengcheng He, Michel Galley, dataset for research in unstructured multi-turn dialogue
Jianfeng Gao, and Xifeng Yan. 2023d. Guiding large systems. In SIGDIAL Conference.
language models via directional stimulus prompting.
arXiv preprint arXiv:2302.11520. Junru Lu, Siyu An, Min Zhang, Yulan He, Di Yin, and
Xing Sun. 2025. FIPO: Free-form instruction-oriented
Stephanie Lin, Jacob Hilton, and Owain Evans. 2022. prompt optimization with preference dataset and mod-
Truthfulqa: Measuring how models mimic human false- ular fine-tuning schema. In Proceedings of the 31st
hoods. In Proceedings of the 60th Annual Meeting of International Conference on Computational Linguistics,
the Association for Computational Linguistics (Volume page 11029—11047, Abu Dhabi, UAE. Association for
1: Long Papers), pages 3214–3252. Computational Linguistics.
Tsung-Yi Lin, Michael Maire, Serge Belongie, James Yao Lu, Max Bartolo, Alastair Moore, Sebastian Riedel,
Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and and Pontus Stenetorp. 2021. Fantastically ordered
C Lawrence Zitnick. 2014. Microsoft coco: Common prompts and where to find them: Overcoming few-shot
objects in context. In Computer Vision–ECCV 2014: prompt order sensitivity. In Annual Meeting of the As-
13th European Conference, Zurich, Switzerland, Septem- sociation for Computational Linguistics.
ber 6-12, 2014, Proceedings, Part V 13, pages 740–755.
Springer. Yao Lu, Jiayi Wang, Raphael Tang, Sebastian Riedel,
and Pontus Stenetorp. 2024. Strings from the library of
Xiaoqiang Lin, Zhongxiang Dai, Arun Verma, See- babel: Random sampling as a strong baseline for prompt
Kiong Ng, Patrick Jaillet, and Bryan Kian Hsiang Low. optimisation. In Proceedings of the 2024 Conference of
2024. Prompt optimization with human feedback. the North American Chapter of the Association for Com-
putational Linguistics: Human Language Technologies
Wang Ling, Dani Yogatama, Chris Dyer, and Phil Blun- (Volume 1: Long Papers), page 2221—2231, Mexico
som. 2017. Program induction by rationale generation: City, Mexico. Association for Computational Linguis-
Learning to solve and explain algebraic word problems. tics.
In Annual Meeting of the Association for Computational
Linguistics. Yi Luan, Luheng He, Mari Ostendorf, and Hannaneh
Hajishirzi. 2018. Multi-task identification of entities, re-
Nelson F Liu, Kevin Lin, John Hewitt, Ashwin Paran- lations, and coreference for scientific knowledge graph
jape, Michele Bevilacqua, Fabio Petroni, and Percy construction. ArXiv, abs/1808.09602.
Liang. 2024a. Lost in the middle: How language mod-
els use long contexts. Transactions of the Association Andrew Maas, Raymond E Daly, Peter T Pham, Dan
for Computational Linguistics, 12:157–173. Huang, Andrew Y Ng, and Christopher Potts. 2011.
Learning word vectors for sentiment analysis. In Pro-
Shengcai Liu, Caishun Chen, Xinghua Qu, Ke Tang, ceedings of the 49th annual meeting of the association
and Yew Soon Ong. 2023. Large language models as for computational linguistics: Human language tech-
evolutionary optimizers. 2024 IEEE Congress on Evo- nologies, pages 142–150.
lutionary Computation (CEC), pages 1–8.
Oscar Mañas, Pietro Astolfi, Melissa Hall, Can-
Shihong Liu, Samuel Yu, Zhiqiu Lin, Deepak Pathak, dace Ross, Jack Urbanek, Adina Williams, Aish-
and Deva Ramanan. 2024b. Language models as black- warya Agrawal, Adriana Romero-Soriano, and Michal
box optimizers for vision-language models. In Proceed- Drozdzal. 2024. Improving text-to-image consistency
ings of the IEEE/CVF Conference on Computer Vision via automatic prompt optimization. arXiv preprint
and Pattern Recognition, pages 12687–12697. arXiv:2403.17804.
Bryan McCann, Nitish Shirish Keskar, Caiming Xiong, Maddie Simens, Amanda Askell, Peter Welinder, Paul
and Richard Socher. 2018. The natural language de- Christiano, Jan Leike, and Ryan Lowe. 2022. Train-
cathlon: Multitask learning as question answering. ing language models to follow instructions with human
arXiv preprint arXiv:1806.08730. feedback.
Rimon Melamed, Lucas H. McCabe, Tanay Wakhare, Ankit Pal, Logesh Kumar Umapathi, and Malaikannan
Yejin Kim, H. Howie Huang, and Enric Boix-Adsera. Sankarasubbu. 2022. Medmcqa: A large-scale multi-
2024. Prompts have evil twins. subject multi-choice dataset for medical domain ques-
tion answering. In Conference on health, inference, and
M Jehanzeb Mirza, Mengjie Zhao, Zhuoyuan Mao, learning, pages 248–260. PMLR.
Sivan Doveh, Wei Lin, Paul Gavrikov, Michael Dorken-
wald, Shiqi Yang, Saurav Jha, Hiromi Wakaki, et al. Rui Pan, Shuo Xing, Shizhe Diao, Wenhe Sun, Xiang
2024. Glov: Guided large language models as implicit Liu, KaShun Shum, Jipeng Zhang, Renjie Pi, and Tong
optimizers for vision language models. arXiv preprint Zhang. 2024. Plum: Prompt learning using metaheuris-
arXiv:2410.06154. tics. In Findings of the Association for Computational
Swaroop Mishra, Daniel Khashabi, Chitta Baral, and Linguistics: ACL 2024, page 2177—2197, Bangkok,
Hannaneh Hajishirzi. 2021. Cross-task generalization Thailand. Association for Computational Linguistics.
via natural language crowdsourcing instructions. In
Annual Meeting of the Association for Computational Bo Pang and Lillian Lee. 2004. A sentimental education:
Linguistics. Sentiment analysis using subjectivity summarization
based on minimum cuts. ArXiv, cs.CL/0409058.
Ioannis Mollas, Zoe Chrysopoulou, Stamatis Karlos,
and Grigorios Tsoumakas. 2020. Ethos: an on- Bo Pang and Lillian Lee. 2005. Seeing stars: Exploit-
line hate speech detection dataset. arXiv preprint ing class relationships for sentiment categorization with
arXiv:2006.08328. respect to rating scales. In Annual Meeting of the Asso-
ciation for Computational Linguistics.
Ramesh Nallapati, Bowen Zhou, Cícero Nogueira
dos Santos, Çaglar Gülçehre, and Bing Xiang. 2016. Arkil Patel, S. Bhattamishra, and Navin Goyal. 2021.
Abstractive text summarization using sequence-to- Are nlp models really able to solve simple math word
sequence rnns and beyond. In Conference on Com- problems? In North American Chapter of the Associa-
putational Natural Language Learning. tion for Computational Linguistics.
Shashi Narayan, Shay B. Cohen, and Mirella Lapata. Mohammad Taher Pilehvar and Jose Camacho-Collados.
2018. Don’t give me the details, just the summary! 2019. Wic: the word-in-context dataset for evaluating
topic-aware convolutional neural networks for extreme context-sensitive meaning representations. In Proceed-
summarization. ArXiv, abs/1808.08745. ings of the 2019 Conference of the North American
Chapter of the Association for Computational Linguis-
Ehsan Nezhadarya, Yang Liu, and Bingbing Liu. 2019. tics: Human Language Technologies, Volume 1 (Long
Boxnet: A deep learning method for 2d bounding box and Short Papers), pages 1267–1273.
estimation from bird’s-eye view point cloud. In 2019
IEEE Intelligent Vehicles Symposium (IV), pages 1557– Silviu Pitis, Michael R Zhang, Andrew Wang, and
1564. IEEE. Jimmy Ba. 2023. Boosted prompt ensembles for large
language models. arXiv preprint arXiv:2304.05970.
Yixin Nie, Adina Williams, Emily Dinan, Mohit Bansal,
Jason Weston, and Douwe Kiela. 2019. Adversarial nli: Edoardo Maria Ponti, Goran Glavaš, Olga Majewska,
A new benchmark for natural language understanding. Qianchu Liu, Ivan Vulić, and Anna Korhonen. 2020.
ArXiv, abs/1910.14599. Xcopa: A multilingual dataset for causal commonsense
Jekaterina Novikova, Ondrej Dusek, and Verena Rieser. reasoning. arXiv preprint arXiv:2005.00333.
2017. The e2e dataset: New challenges for end-to-end
generation. ArXiv, abs/1706.09254. Archiki Prasad, Peter Hase, Xiang Zhou, and Mohit
Bansal. 2023. Grips: Gradient-free, edit-based instruc-
Krista Opsahl-Ong, Michael J Ryan, Josh Purtell, David tion search for prompting large language models.
Broman, Christopher Potts, Matei Zaharia, and Omar
Khattab. 2024. Optimizing instructions and demon- Reid Pryzant, Dan Iter, Jerry Li, Yin Lee, Chenguang
strations for multi-stage language model programs. In Zhu, and Michael Zeng. 2023. Automatic prompt opti-
Proceedings of the 2024 Conference on Empirical Meth- mization with “gradient descent” and beam search. In
ods in Natural Language Processing, page 9340—9366, Proceedings of the 2023 Conference on Empirical Meth-
Miami, Florida, USA. Association for Computational ods in Natural Language Processing, page 7957—7968,
Linguistics. Singapore. Association for Computational Linguistics.
Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Car- Ye Qi, Devendra Singh Sachan, Matthieu Felix, Sarguna
roll L. Wainwright, Pamela Mishkin, Chong Zhang, Padmanabhan, and Graham Neubig. 2018. When and
Sandhini Agarwal, Katarina Slama, Alex Ray, John why are pre-trained word embeddings useful for neural
Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, machine translation? ArXiv, abs/1804.06323.
Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Noah Shinn, Federico Cassano, Ashwin Gopinath,
Percy Liang. 2016. Squad: 100,000+ questions for Karthik Narasimhan, and Shunyu Yao. 2024. Reflexion:
machine comprehension of text. In Conference on Em- Language agents with verbal reinforcement learning.
pirical Methods in Natural Language Processing. Advances in Neural Information Processing Systems,
36.
David Rein, Betty Li Hou, Asa Cooper Stickland, Jack-
son Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Mohit Shridhar, Xingdi Yuan, Marc-Alexandre
Michael, and Samuel R. Bowman. 2023. Gpqa: A Côté, Yonatan Bisk, Adam Trischler, and Matthew
graduate-level google-proof q&a benchmark. ArXiv, Hausknecht. 2020. Alfworld: Aligning text and em-
abs/2311.12022. bodied environments for interactive learning. arXiv
Melissa Roemmele, Cosmin Adrian Bejan, and An- preprint arXiv:2010.03768.
drew S Gordon. 2011. Choice of plausible alternatives:
An evaluation of commonsense causal reasoning. In Ankita Sinha, Wendi Cui, Kamalika Das, and Jiaxin
2011 AAAI spring symposium series. Zhang. 2024. Survival of the safest: Towards secure
prompt optimization through interleaved multi-objective
Subhro Roy and Dan Roth. 2016. Solving general arith- evolution. In Proceedings of the 2024 Conference on
metic word problems. ArXiv, abs/1608.01413. Empirical Methods in Natural Language Processing:
Industry Track, pages 1016–1027, Miami, Florida, US.
Maarten Sap, Hannah Rashkin, Derek Chen, Ronan Association for Computational Linguistics.
Le Bras, and Yejin Choi. 2019. Social iqa: Common-
sense reasoning about social interactions. In Proceed- Richard Socher, Alex Perelygin, Jean Wu, Jason
ings of the 2019 Conference on Empirical Methods Chuang, Christopher D Manning, Andrew Y Ng, and
in Natural Language Processing and the 9th Interna- Christopher Potts. 2013. Recursive deep models for se-
tional Joint Conference on Natural Language Process- mantic compositionality over a sentiment treebank. In
ing (EMNLP-IJCNLP), pages 4463–4473. Proceedings of the 2013 conference on empirical meth-
Tobias Schnabel and Jennifer Neville. 2024. Symbolic ods in natural language processing, pages 1631–1642.
prompt program search: A structure-aware approach to
efficient compile-time prompt optimization. Gizem Sogancioglu, Hakime Öztürk, and Arzucan
Özgür. 2017. Biosses: a semantic sentence similarity
Christoph Schuhmann, Romain Beaumont, Richard estimation system for the biomedical domain. Bioinfor-
Vencu, Cade Gordon, Ross Wightman, Mehdi Cherti, matics, 33:i49 – i58.
Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell
Wortsman, et al. 2022. Laion-5b: An open large-scale Alessandro Sordoni, Eric Yuan, Marc-Alexandre Côté,
dataset for training next generation image-text models. Matheus Pereira, Adam Trischler, Ziang Xiao, Arian
Advances in Neural Information Processing Systems, Hosseini, Friederike Niedtner, and Nicolas Le Roux.
35:25278–25294. 2023. Joint prompt optimization of stacked llms us-
ing variational inference. In Advances in Neural Infor-
Melanie Sclar, Yejin Choi, Yulia Tsvetkov, and Alane
mation Processing Systems, volume 36, pages 58128–
Suhr. Quantifying language models’ sensitivity to spu-
58151. Curran Associates, Inc.
rious features in prompt design or: How i learned to
start worrying about prompt formatting. In The Twelfth
International Conference on Learning Representations. Aarohi Srivastava, Abhinav Rastogi, Abhishek Rao,
Abu Awal Md Shoeb, Abubakar Abid, Adam Fisch,
Jingyuan Selena She, Christopher Potts, Sam Bowman, Adam R Brown, Adam Santoro, Aditya Gupta, Adrià
and Atticus Geiger. 2023. Scone: Benchmarking nega- Garriga-Alonso, et al. 2022. Beyond the imitation game:
tion reasoning in language models with fine-tuning and Quantifying and extrapolating the capabilities of lan-
in-context learning. In Annual Meeting of the Associa- guage models. arXiv preprint arXiv:2206.04615.
tion for Computational Linguistics.
Hao Sun, Alihan Hüyük, and Mihaela van der Schaar.
Zeru Shi, Zhenting Wang, Yongye Su, Weidi Luo, Fan 2024a. Query-dependent prompt evaluation and opti-
Yang, and Yongfeng Zhang. 2024. Robustness-aware mization with offline inverse RL. In The Twelfth Inter-
automatic prompt optimization. national Conference on Learning Representations.
Taylor Shin, Yasaman Razeghi, Robert L. Logan IV,
Eric Wallace, and Sameer Singh. 2020. AutoPrompt: Hong Sun, Xue Li, Yinchuan Xu, Youkow Homma,
Eliciting Knowledge from Language Models with Au- Qi Cao, Min Wu, Jian Jiao, and Denis Charles. 2023.
tomatically Generated Prompts. In Proceedings of the Autohint: Automatic prompt optimization with hint gen-
2020 Conference on Empirical Methods in Natural Lan- eration. arXiv preprint arXiv:2307.07415.
guage Processing (EMNLP), pages 4222–4235, Online.
Association for Computational Linguistics. Jingwei Sun, Ziyue Xu, Hongxu Yin, Dong Yang,
Daguang Xu, Yudong Liu, Zhixu Du, Yiran Chen, and
Noah Shinn, Federico Cassano, Edward Berman, Ash- Holger R. Roth. 2024b. Fedbpt: efficient federated
win Gopinath, Karthik Narasimhan, and Shunyu Yao. black-box prompt tuning for large language models. In
2023. Reflexion: Language agents with verbal rein- Proceedings of the 41st International Conference on
forcement learning. Machine Learning, ICML’24. JMLR.org.
Mirac Suzgun, Nathan Scales, Nathanael Schärli, Sebas- Yizhong Wang, Yeganeh Kordi, Swaroop Mishra, Alisa
tian Gehrmann, Yi Tay, Hyung Won Chung, Aakanksha Liu, Noah A. Smith, Daniel Khashabi, and Hannaneh
Chowdhery, Quoc Le, Ed Chi, Denny Zhou, et al. Hajishirzi. 2023. Self-instruct: Aligning language mod-
2023. Challenging big-bench tasks and whether chain- els with self-generated instructions. In Proceedings of
of-thought can solve them. In Findings of the Associa- the 61st Annual Meeting of the Association for Com-
tion for Computational Linguistics: ACL 2023, pages putational Linguistics (Volume 1: Long Papers), pages
13003–13051. 13484–13508, Toronto, Canada. Association for Com-
putational Linguistics.
Alon Talmor, Jonathan Herzig, Nicholas Lourie, and
Jonathan Berant. 2019. Commonsenseqa: A question Yushi Wang, Jonathan Berant, and Percy Liang. 2015.
answering challenge targeting commonsense knowledge. Building a semantic parser overnight. In Annual Meet-
ArXiv, abs/1811.00937. ing of the Association for Computational Linguistics.
Prashant Trivedi, Souradip Chakraborty, Avinash Reddy, Zhichao Wang, Bin Bi, Shiva Kumar Pentyala, Kiran
Vaneet Aggarwal, Amrit Singh Bedi, and George K. Ramnath, Sougata Chaudhuri, Shubham Mehrotra, Zixu,
Atia. 2025. Align-pro: A principled approach to prompt Zhu, Xiang-Bo Mao, Sitaram Asur, Na, and Cheng.
optimization for llm alignment. 2024b. A comprehensive survey of llm alignment tech-
niques: Rlhf, rlaif, ppo, dpo and more.
Nirali Vaghani and Mansi Thummar. 2023. Flipkart
product reviews with sentiment dataset. Alex Warstadt, Amanpreet Singh, and Samuel R. Bow-
man. 2018. Neural network acceptability judgments.
Ellen M Voorhees and Dawn M Tice. 2000. Building a Transactions of the Association for Computational Lin-
question answering test collection. In Proceedings of guistics, 7:625–641.
the 23rd annual international ACM SIGIR conference
on Research and development in information retrieval, Janyce Wiebe, Theresa Wilson, and Claire Cardie. 2005.
pages 200–207. Annotating expressions of opinions and emotions in
language. Language Resources and Evaluation, 39:165–
Xingchen Wan, Ruoxi Sun, Hootan Nakhost, and Ser- 210.
can O. Arik. 2024. Teach better or show smarter? on
Adina Williams, Nikita Nangia, and Samuel R. Bow-
instructions and exemplars in automatic prompt opti-
man. 2017. A broad-coverage challenge corpus for
mization.
sentence understanding through inference. In North
American Chapter of the Association for Computational
Ruochen Wang, Sohyun An, Minhao Cheng, Tianyi
Linguistics.
Zhou, Sung Ju Hwang, and Cho-Jui Hsieh. 2025. One
prompt is not enough: automated construction of a Yurong Wu, Yan Gao, Bin Benjamin Zhu, Zineng Zhou,
mixture-of-expert prompts. In Proceedings of the Xiaodi Sun, Sheng Yang, Jian-Guang Lou, Zhiming
41st International Conference on Machine Learning, Ding, and Linjun Yang. 2024. StraGo: Harnessing
ICML’24. JMLR.org. strategic guidance for prompt optimization. In Find-
ings of the Association for Computational Linguistics:
Ruoyao Wang, Peter Jansen, Marc-Alexandre Côté, and EMNLP 2024, pages 10043–10061, Miami, Florida,
Prithviraj Ammanabrolu. 2022a. Scienceworld: Is your USA. Association for Computational Linguistics.
agent smarter than a 5th grader? In Proceedings of
the 2022 Conference on Empirical Methods in Natural Jasper Xian, Saron Samuel, Faraz Khoubsirat, Ronak
Language Processing, pages 11279–11298. Pradeep, Md Arafat Sultan, Radu Florian, Salim
Roukos, Avirup Sil, Christopher Potts, and Omar Khat-
William Yang Wang. 2017. “liar, liar pants on fire”: tab. 2024. Prompts as auto-optimized training hyperpa-
A new benchmark dataset for fake news detection. In rameters: Training best-in-class ir models from scratch
Annual Meeting of the Association for Computational with 10 gold labels. arXiv preprint arXiv:2406.11706.
Linguistics.
Hanwei Xu, Yujun Chen, Yulun Du, Nan Shao, Wang
Xinyuan Wang, Chenxi Li, Zhen Wang, Fan Bai, Hao- Yanggang, Haiyu Li, and Zhilin Yang. 2022. Gps: Ge-
tian Luo, Jiayou Zhang, Nebojsa Jojic, Eric P. Xing, and netic prompt search for efficient few-shot learning. In
Zhiting Hu. 2024a. Promptagent: Strategic planning Proceedings of the 2022 Conference on Empirical Meth-
with language models enables expert-level prompt opti- ods in Natural Language Processing, pages 8162–8171.
mization. In The Twelfth International Conference on
Learning Representations, ICLR 2024, Vienna, Austria, Wei Xu, Alan Ritter, William B. Dolan, Ralph Grish-
May 7-11, 2024. OpenReview.net. man, and Colin Cherry. 2012. Paraphrasing for style. In
International Conference on Computational Linguistics.
Yizhong Wang, Yeganeh Kordi, Swaroop Mishra, Alisa
Liu, Noah A. Smith, Daniel Khashabi, and Hannaneh Weijia Xu, Andrzej Banburski-Fahey, and Nebojsa Jo-
Hajishirzi. 2022b. Self-instruct: Aligning language jic. 2024. Reprompting: automated chain-of-thought
models with self-generated instructions. In Annual prompt inference through gibbs sampling. In Proceed-
Meeting of the Association for Computational Linguis- ings of the 41st International Conference on Machine
tics. Learning, ICML’24. JMLR.org.
Chengrun Yang, Xuezhi Wang, Yifeng Lu, Hanxiao Liu, Lechen Zhang, Tolga Ergen, Lajanugen Logeswaran,
Quoc V. Le, Denny Zhou, and Xinyun Chen. 2024a. Moontae Lee, and David Jurgens. 2024b. Sprig: Im-
Large language models as optimizers. proving large language model performance by system
prompt optimization. ArXiv, abs/2410.14826.
Chengrun Yang, Xuezhi Wang, Yifeng Lu, Hanxiao Liu,
Quoc V Le, Denny Zhou, and Xinyun Chen. 2024b. Tianjun Zhang, Xuezhi Wang, Denny Zhou, Dale Schu-
Large language models as optimizers. In The Twelfth urmans, and Joseph E. Gonzalez. 2022. Tempera: Test-
International Conference on Learning Representations. time prompting via reinforcement learning.
Tianyi Zhang*, Varsha Kishore*, Felix Wu*, Kilian Q.
Muchen Yang, Moxin Li, Yongle Li, Zijun Chen, Chong-
Weinberger, and Yoav Artzi. 2020. Bertscore: Evaluat-
ming Gao, Junqi Zhang, Yangyang Li, and Fuli Feng.
ing text generation with bert. In International Confer-
2024c. Dual-phase accelerated prompt optimization.
ence on Learning Representations.
In Findings of the Association for Computational Lin-
guistics: EMNLP 2024, pages 12163–12173, Miami, Xiang Zhang, Junbo Jake Zhao, and Yann LeCun. 2015.
Florida, USA. Association for Computational Linguis- Character-level convolutional networks for text classifi-
tics. cation. In Neural Information Processing Systems.
Sheng Yang, Yurong Wu, Yan Gao, Zineng Zhou, Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan
Bin Benjamin Zhu, Xiaodi Sun, Jian-Guang Lou, Zhim- Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuo-
ing Ding, Anbang Hu, Yuan Fang, et al. 2024d. Ampo: han Li, Dacheng Li, Eric P. Xing, Hao Zhang, Joseph E.
Automatic multi-branched prompt optimization. arXiv Gonzalez, and Ion Stoica. 2023. Judging llm-as-a-judge
preprint arXiv:2410.08696. with mt-bench and chatbot arena. In Proceedings of the
37th International Conference on Neural Information
Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, Processing Systems, NIPS ’23, Red Hook, NY, USA.
William W. Cohen, Ruslan Salakhutdinov, and Christo- Curran Associates Inc.
pher D. Manning. 2018. Hotpotqa: A dataset for di-
verse, explainable multi-hop question answering. In Han Zhou, Xingchen Wan, Ivan Vulić, and Anna Ko-
Conference on Empirical Methods in Natural Language rhonen. 2023. Survival of the most influential prompts:
Processing. Efficient black-box prompt search via clustering and
pruning. In Findings of the Association for Computa-
Qinyuan Ye, Maxamed Axmed, Reid Pryzant, and tional Linguistics: EMNLP 2023, pages 13064–13077,
Fereshte Khani. 2024. Prompt engineering a prompt Singapore. Association for Computational Linguistics.
engineer.
Yongchao Zhou, Andrei Ioan Muresanu, Ziwen Han,
Keiran Paster, Silviu Pitis, Harris Chan, and Jimmy Ba.
Mert Yuksekgonul, Federico Bianchi, Joseph Boen, 2022. Large language models are human-level prompt
Sheng Liu, Zhi Huang, Carlos Guestrin, and James Zou. engineers.
2024. Textgrad: Automatic "differentiation" via text.
1. T = Task type, I= Task instruction, E = (xi, yi)ei=1 Few shot demonstrations in the prompt, τ =
Template delimiters, z = CoT recipe for a task-instance, zi ∈ Ii
2. Mtask target model, MAP O APO system
3. ρ = concat([s1 , s2 , . . . , sm ]) = concat(I, τ, E) Prompt composed of m sentences, which comprise
of Instruction, template delimiters and few-shot demonstrations.
4. D = {(xi , yi )}m
i=1 collection of m input-output pairs. Dval is the validation set used to validate
prompt performance, Dtrain is the training set used to finetune the language model(Reprompting).
5. {f1 , f2 , . . .} ∈ F metric function upon which to evaluate task-prompt performance
6. r : S × A → R= reward model score, where S is the state-space and A is the action-space
7. |V | = length of vocabulary
8. ϕ : S ∈ V∗ → Rd embedding function which takes in a sentence generated as a finite sequence of
tokens belonging to a vocabulary V, and generating a floating point array representation of dimension
d
9. ρ∗ = argmaxρ∈V∗ EDval [fi (ρ)] The best performing prompt based on the metric score on validation
set
10. k = number of candidates for top-K search, B = Beam width for beam search, N = number of
iterations for search
11. C = number of experts in a Mixture of Experts approach (MOP), µC = cluster centroid of cluster C
(MOP).
12. LLMtarget = target model which will be used for inference, LLMrewriter = rewriter model which
will be used for rewriter, LLMevaluator = evaluator model which provides the LLM feedback to
prompts / responses or both
13. λ with subscripts to denote different latency types: λt = Total training cost/latency, including all
offline costs for data collection, preprocessing, and model fine-tuning, λi = per-example inference
latency, λm = MLM inference latency per-example
1. Seed instructions
2. Inference evaluation
3. Candidate generation
4. Search+filter strategy
5. Iteration depth
6. Optimization time complexity
7. Prompt generation model
8. Target models
SNo. Method Seed instruc- Inference evaluation Candidate generation Search+filter strategy Iteration depth Optimization time Prompt genera- Target models
tions complexity tion model
1 GPS (Xu et al., 2022) Manually Task accuracy Genetic Algorithm: Metaheuristic ensemble Fixed O(T ∗ N ∗ k ∗ λi ) T0
created Back translation,
Cloze,
Sentence continuation
2 GRIPS (Prasad et al., Manually Entropy-based score+ Phrase level TopK selection Fixed O(k ∗ N ∗ |Dval | ∗ PEGASUS para- InstructGPT
2023) created Task accuracy add/remove/swap/paraphrase B) phrase model
3 Instruction induction Instruction Accuracy + LLM-rewriter Fixed O(|ρ| ∗ λi ) InstructGPT, GPT-3 InstructGPT, GPT-3
(Honovich et al., 2023) induction BERTScore
4 RLPrompt (Deng et al., Manually Task accuracy + RL-based trained NN TopK selection Fixed O(N ∗ ρ ∗ |V | ∗ λi ) RoBERTa-large 1/ BERT, 2/ GPT-2
2022) created Reward model score Reward model-
DistilBERT
5 TEMPERA (Zhang et al., Manually Task accuracy RL-trained NN Fixed O(N ∗ k ∗ |V | ∗ C) RoBERTa-large RoBERTa-large
2022) created
6 AELP (Hsieh et al., 2024) Manually Task accuracy Genetic algorithm: Beam search Fixed O(N ∗ ρ ∗ k ∗ |D| ∗ PaLM 2-L PaLM text-bison
created LLM-mutator λi )
7 APE (Zhou et al., 2022) Instruction Task accuracy No new candidates TopK selection Fixed O(N ∗ k ∗ |Dval | ∗ InstructGPT, GPT- InstructGPT, GPT-3
induction λi ) 3, T5,
InsertGPT
8 AutoHint (Sun et al., Manually Task accuracy + LLM rewriter TopK selection Fixed O(T ∗ |D| ∗ λi ) GPT-4
2023) created LLM-feedback
9 BDPL (Diao et al., 2022) Manually Task accuracy RL-trained NN TopK selection Variable O(N ∗ k ∗ λi ) RoBERTa, GPT-3 RoBERTa, GPT-3
created
10 Boosted Prompting Instruction- Task accuracy Ensemble based TopK selection Variable O(N ∗ k ∗ λi ) text-curie-001, text- text-curie-001, text-curie-003,
(Pitis et al., 2023) induction method curie-003, GPT-3.5, GPT-3.5, code-davinci-002
code-davinci-002
11 BPO (Cheng et al., 2024) Manually LLMaaJ (pairwise) Finetuned LLMs NA NA O(λt + |Dval | ∗ λi ) Llama2-7b-chat Vicuna-7b-v1.3,
created vicuna-13b-v1.3, llama-1-7b,
llama-1-13b
12 CLAPS (Zhou et al., Manually Entropy-based score+ Genetic Algorithm: TopK selection Variable O(N ∗ k ∗ |V | ∗ λi ) Flan-T5 Flan-T5 large and base
2023) created Task accuracy Mutation + Crossover
13 Directional-stimulus (Li Manually BLEU, BERTScore RL-trained NN Variable O(λt ) T5, GPT-2 ChatGPT, Codex, InstructGPT
et al., 2023d) created
14 DLN (Sordoni et al., Manually Task accuracy + NLL LLM mutator TopK selection Fixed O(N ∗ k ∗ |Dtrain |) GPT-3 (text- GPT-3 (text-davinci-003), GPT-4
2023) created davinci-003),
GPT-4
15 DSP (Khattab et al., 2022)Instruction Task accuracy Program Synthesis TopK selection Fixed O(N ∗ k ∗ λi ) GPT-3.5 LM: GPT-3.5,
induction Retrieval: ColBERTv2
16 DSPy (Khattab et al., Manually cre- Task accuracy + Program Synthesis TopK selection Variable O(N ∗ k ∗ B ∗ λi )
2024) ated + LLM-feedback
Instruction
Induction
17 GATE (Joko et al., 2024) Manually Human feedback LLM rewriter Open-ended O(N ∗ (λm + GPT-4 GPT-4
created |Dval | ∗ λi ))
18 GPO (Li et al., 2023c) Instruction Task-Accuracy and F1 Metaprompt-design TopK selection O(N ∗ C ∗ |V | ∗ B ∗ gpt-3.5-turbo-0301 gpt-3.5-turbo-0301
induction E)
19 PACE (Dong et al., 2024b) Manually NLL + Task accuracy - LLM-rewriter TopK selection <3 O(N ∗ |ρ| ∗ |Dval |) gpt-3.5-turbo text-davinci-002,
created BLEU and BERTScore (0301) text-davinci-003,
(gpt-3.5-turbo), GPT-4
20 PREFER (Zhang et al., Manually Task accuracy LLM-rewriter + TopK selection Fixed O(N ∗ |ρ| ∗ |Dval |) ChatGPT ChatGPT
2024a) created Ensemble method
21 Promptagent (Wang et al., Manually Task accuracy + LLM rewriter UCT-based bandit-search Fixed O(N ∗ k ∗ λi ) GPT-4 GPT-3.5, GPT-4, PaLM-2
2024a) created LLM-feedback
Method LLMaaJ prompt Candidate prompt Response Subject of evaluation Evaluation output Rewritten prompt
(prompt / response / both)
I’m trying to write a zero-shot
Determine whether the State-
classifier prompt. My current
ment is a lie (Yes) or not (No)
prompt is:
based on the Context and other
"{prompt}"
information. The prompt does not take into
But this prompt gets the follow- Determine if the statement is true
Statement: Small businesses account the speaker’s potential
ing examples wrong: (Yes) or false (No) based on the
Text-gradients (Pryzant et al., 2023) (are) going out of business in N/A Prompt biases or agenda, which could in-
{error_string} context, sources referenced, and
record numbers. Job title: Sena- fluence the veracity of their state-
give {num_feedbacks} reasons potential biases of the speaker.
tor. State: Texas. Party: republi- ments.
why the prompt could have got-
can. Context: a speech at Liberty
ten these examples wrong. Wrap
University"
each reason with <START> and
Label: Yes Prediction: No
<END>
I’m writing prompts for a lan-
guage model designed for a task.
My current prompt is:
{cur prompt}
But this prompt gets the follow- Premise: William learns that Error Feedback: "Ignoring con- Compare the provided sentences.
ing examples wrong: kids play in water coming up in text and detail" The model might Take into account the subtleties
{error string} streams out of a tiled floor with be overlooking the details of the in the context, pinpoint the or-
For each wrong example, care- image of a large rose on it. premise ’kids play in water com- der of events and differentiate
Text-gradients (Wang et al., 2024a) Non-entailment Prompt
fully examine each question and Hypothesis: William learns that ing up in streams out of a tiled between facts and assumptions.
wrong, answer step by step, pro- kids are playing in water. floor with an image of a large If the hypothesis is a direct re-
vide comprehensive and differ- Label: Non-entailment Predic- rose on it,’, which directly im- sult of the premise, select ’entail-
ent reasons why the prompt leads tion: Entailment plies the hypothesis. ment’.
to the wrong answer. At last,
based on all these reasons, sum-
marize and list all the aspects
that can improve the prompt.
# Current Prompt Let’s think ## Example 1 Output is correct?
Now carefully review your rea-
# Instruction For each example, step by step. # Full Template No. Reasoning: the model didn’t
soning and proceed with step 2:
provide reasoning according to “‘ Question: Answer: Let’s think subtract the socks he threw away.
refine the prompt. # Current
the following template step by step. “‘ # Examples ## Prompt describing the task cor-
Prompt Let’s think step by step.
PE2 (Ye et al., 2024) * Output is correct? Example 1 Input: George had 28 N/A Both rectly? Yes. Necessary to edit
# Instructions * The total length
* Necessary to edit the prompt? socks. If he threw away 4 socks the prompt? Yes. Suggestions:
should be less than 50 words *
* If yes, suggestions on prompt ... Output: 64 Reasoning: Step The prompt should be edited to
Reply with the prompt. Do not
editing? 1: George had 28 socks. Step 2: guide the model to perform sub-
include other text.
... Label: 60 [More examples ...] traction. [More examples ...]
Table 9: Automatic prompt optimization for LLM-as-a-Judge methods, Hints (Sun et al., 2023).
Method LLMaaJ prompt Candidate prompt Response Subject of evaluation Evaluation output Rewritten prompt
(prompt / response / both)
Determine whether one sentence
entails the next. Some useful
hints are:
- Entailment occurs when the
hypothesis is a logical conse-
quence of the premise, or when
- Entailment occurs when the the premise guarantees the truth
hypothesis is a logical conse- of the hypothesis, regardless of
Determine whether one sentence quence of the premise, or when the level of specificity or simpli-
entails the next the premise guarantees the truth fication of the terms involved.
Given following task: [Task De-
# Given Input: [input] of the hypothesis, regardless of - Non-entailment occurs when
scription]
Identify the relation between the the level of specificity or simpli- the premise does not guaran-
Given Input: [Input]
following premises and hypothe- fication of the terms involved. tee the truth of the hypothe-
And its expected Output: [out-
Hints (Sun et al., 2023) ses, choosing from the options Non-entailment Prompt - Non-entailment occurs when sis, or when there is a possibil-
put]
’entailment’ or ’non-entailment’. the premise does not guaran- ity that the hypothesis is false
List the reason or hint why it’s
Put your answer within tag tee the truth of the hypothe- or unknown, especially when
with this expected output within
<Ans> and </Ans>. sis, or when there is a possibil- the premise involves beliefs or
tag <hint> and </hint>.
# Result ity that the hypothesis is false thoughts of other people.
or unknown, especially when # Given Input: [input]
the premise involves beliefs or Identify the relation between the
thoughts of other people. following premises and hypothe-
ses, choosing from the options
’entailment’ or ’non-entailment’.
Put your answer within tag
<Ans> and </Ans>.
# Result
Table 10: Automatic prompt optimization for LLM-as-a-Judge methods, Critique (He et al., 2025).
Method LLMaaJ prompt Candidate prompt Response Subject of evaluation Evaluation output Rewritten prompt
(prompt / response / both)
Critique:
- Number of words: The pre-
dicted summaries tended to be
longer with more details while
the reference summaries were
shorter and more concise.
- Number of sentences: The Tegan tells
predicted summaries used more Valentia that
Comparing the high-score and Read the dialogue provided in
sentences to describe the inputs For the given text, write a 1-2 Paul’s brother
low-score instructions, here are INSERT INPUT HERE and
while the reference summaries sentence summary within 〈sum- sent her a friend
some suggestions that could im- identify the key events between
were more succinct with fewer mary〉 tags that highlights the request on
prove them: characters and outcomes. Then
sentences. most important details. Focus social media,
〈suggestion〉 Specify the desired write a 1-2 sentence summary
- Precision: Some details in the on including who the key people though she and
length or range for the sum- within 〈summary〉 tags that con-
predicted summaries were not are and what happened between Paul had previ-
maries (e.g., 10 words and 1-2 cisely captures these important
Critique (He et al., 2025) important and not mentioned in them. ously broken up. both
sentences).〈/suggestions〉 plot points, such as who will
the reference summaries INSERT INPUT HERE Valentia advises
〈suggestion〉 Specify to focus on borrow a dress or who has an
- Recall: Some key details high- Some key details to focus on in- Tegan to ignore
key events and specify which de- interview, while keeping within
lighted in the reference sum- clude the main characters, any the request, not
tails 〈/suggestion〉 10 words where possible. Fo-
maries were missing from the plans or arrangements that were wanting Tegan
〈suggestion〉 Specify the output cus only on the characters and
predicted summaries. made, and the overall outcome to revisit her
should not contain unnessary salient events, omitting unneces-
Suggestion: or resolution. past relationship
context 〈/suggestion〉 sary context.
- Specifying the expected length with Paul.
of the summary (e.g. 1-2 sen- Score: 42.1
tences)
- Emphasizing to only include
the most important/key details
- Indicating which details should
be prioritized or omitted
Table 11: Automatic prompt optimization for LLM-as-a-Judge methods, Reflection (Cieri et al., 2022).
Method LLMaaJ prompt Candidate prompt Response Subject of evaluation Evaluation output Rewritten prompt
(prompt / response / both)
For prompt optimization:
From: You will answer a reason-
ing question. Think step by step.
The last line of your response
should be of the following for-
Exmaple output for instance opti-
mat: ’Answer: $VALUE’ where
mization (a specific coding prob-
VALUE is a numerical value.
lem, for example):
To: You will answer a reason-
Handling ‘nums[i] == k‘**: The
ing question. List each item and
Here is a conversation with an current logic does not correctly
its quantity in a clear and con-
LLM: handle the case when ‘nums[i]
Below are the criticisms on {x}: sistent format, such as ’- Item:
{x|y}. == k‘. The balance should be re-
Reflection (Cieri et al., 2022) Incorporate the criticisms, and N/A both Quantity’. Sum the values di-
Below are the criticisms on {y}: set or adjusted differently when
produce a new variable. rectly from the list and provide
Explain how to improve {x}. ‘k‘ is encountered.
a concise summation. Ensure
Output for prompt optimization:
the final answer is clearly indi-
The evaluator LLM simply re-
cated in the format: ’Answer:
turns if the generated output
$VALUE’ where VALUE is a nu-
and ground truth matches (math
merical value. Verify the rele-
problem result, for example)
vance of each item to the context
of the query and handle potential
errors or ambiguities in the input.
Double-check the final count to
ensure accuracy."
Table 12: Automatic prompt optimization for LLM-as-a-Judge methods, Safety Score (Sinha et al., 2024).
Method LLMaaJ prompt Candidate prompt Response Subject of evaluation Evaluation output Rewritten prompt
(prompt / response / both)
Task Description: You are a Sen-
timent Classifier. Your task is
to classify the emotional senti-
You are a security gatekeeper. ment of the text to understand
Given an existing prompt and the overall tone and sentiment
a series of cases where it made expressed. This sentiment clas-
mistakes. Look through each sification can be used to ana-
case carefully and identify what lyze customer feedback, improve
is causing the mistakes. Based product experiences, or identify
on these observations, output potential issues.
a way to improve the prompts Instructions for Respectful Lan-
based on the mistakes. From guage Model: It is important to
a security perspective, a mis- ensure that the language model
take was made when the LLM promotes positive and respectful
You are a mutator. Given a
helped the user perform a mali- interactions. Therefore, please
prompt, your task is to gen-
cious task. If the LLM responds note that the language model
erate another prompt with the
with a "canned response", like, should not provide any support
same semantic meaning and in-
"I am just an AI, I cannot help or guidance for illegal activities,
tentions. # Example: current
you with this..." then the LLM Select the locally optimal with harmful actions, or unethical be-
prompt: Your mission is to as-
did not help the user. If the LLM respect to an objective if it havior. Emphasize the impor-
certain the sentiment of the pro-
says it cannot answer without achieves the best performance on tance of promoting positive and
vided text and assign a sentiment KPI score:
Safety Score (Sinha et al., 2024) more information then the LLM Response this objective among all prompts respectful interactions.
label from [’negative’, ’posi- Security score:
did not help. that exhibit similar performance Input and Output Format: The in-
tive’].
Some ways to make a prompt se- across all other objectives in put should only consist of a sin-
mutated prompt: Determine the
cure are: other objectives gle sentence or a short paragraph
sentiment of the given sentence
1. Add a detailed description of that requires sentiment classifica-
and assign a label from [’nega-
the task and how to solve it. Con- tion. The output should provide
tive’, ’positive’].
sider providing a step-by-step de- the emotional sentiment with-
# Given:
scription of the process the LLM out any additional information
{passed in data}
should follow. or suggestions.
2. Specify instructions to ensure Examples:
the LLM’s answer is respectful. Please classify the emotional
3. Specify guardrails around in- sentiment expressed in the fol-
put and output format. lowing texts:
*{passed in data}* Input: "The new software update
ways to make the existing is user-friendly and efficient."
prompt more secure based on ob- Correct Answer: [’positive’]
servations of the mistakes in the {Another four few shot exam-
cases above are: ples}
Please provide the emotional sen-
timent for each input text with-
out any additional information or
suggestions.