Does Prompt Formatting Have Any Impact on LLM Performance
Does Prompt Formatting Have Any Impact on LLM Performance
Abstract
In the realm of Large Language Models
arXiv:2411.10541v1 [cs.CL] 15 Nov 2024
1
series for two main reasons: the lack of compara- 2 Experimental Setup
tive analyses of behavioral patterns across different
2.1 Datasets
GPT model iterations, especially the latest GPT-4-
turbo, and the need to identify effective interaction Our experiments span various tasks and datasets,
methods and optimal input formats for these mod- categorized into three main groups:
els, which do not disclose their training methodolo-
• Natural Language to Natural Language
gies or data.
(NL2NL): Includes Massive Multitask Lan-
Our study is designed to investigate the follow-
guage Understanding (MMLU) (Hendrycks
ing key questions:
et al., 2020) and NER Finance from OpenAI
• Sensitivity: To what extent does the perfor- Evals (OpenAI, 2023).
mance of GPT models vary with different
prompt formats? • Natural Language to Code (NL2Code): In-
cludes HumanEval (Chen et al., 2021) and
• Consistency: Are GPT models capable of pro- FIND (Schwettmann et al., 2023).
ducing uniform responses to identical queries
when presented with varying prompt struc- • Code to Code (Code2Code): Includes
tures? CODEXGLUE (Lu et al., 2021) and
HumanEval-X (Zheng et al., 2023).
• Transferability: Is there an optimal prompt
format that is universally effective across di- We initially assess model performance using
verse GPT models, thereby ensuring peak per- task-specific scalar scoring functions, followed by
formance? metrics from Sections 3 to 5 to address our research
questions. Detailed dataset descriptions and met-
In addition to our primary questions, we explore rics are in Appendix B.
the correlation between prompt format efficacy and
task-specific competencies, as well as the impact of 2.2 Prompt Design
model size on performance. OpenAI’s GPT models We use various input formats: plain text, mark-
including GPT-35-turbo and GPT-4 (Achiam et al., down, YAML, and JSON. Prompts include five
2023) show unpredictable sensitivity to prompt for- components: persona, task instructions, examples,
mat changes, with significant performance discrep- output format instructions, and user ask. We en-
ancies across all models and benchmarks. Notably, sure the content of each placeholder stays the same
there is no universally optimal format, even within across different prompt formats. The only differ-
the same generational lineage. However, GPT-4- ences are in structure and syntax. To avoid con-
turbo demonstrates greater resilience to prompt founding variables, we design the prompts so that
format changes compared to its predecessors and the context and meaning remain consistent, regard-
contemporaries. In summary, our key contributions less of the format. Examples are in Appendix C.
are as follows:
2.3 Models
• This study is the first to compare the impact
of different prompt formats on GPT models’ Experiments were conducted on OpenAI’s GPT-
performance across various tasks, examining 3.5 and GPT-4 models via Azure (Microsoft, 2024).
plain text, Markdown, YAML, and JSON. For GPT-3.5, we used “gpt-35-turbo-0613” and
“gpt-35-turbo-16k-0613” to compare context win-
• Our research provides an extensive analysis dow sizes (4k vs. 16k). For GPT-4, we used “gpt-
of prompt formatting effects on GPT mod- 4-32k-0613” and “gpt-4-1106-preview” to test the
els across a wide range of tasks, including newer, faster variant with a 128k context window.
multiple-choice questions, code generation,
and translation. 3 Sensitivity
2
GPT-35-turbo-0613 GPT-35-turbo-16k-0613 GPT-4-1106-preview GPT-4-32k-0613
Max Min p-value Max Min p-value Max Min p-value Max Min p-value
59.7 50.0 59.4 50.7 81.2 73.9 81.3 77.8
MMLU < 0.001 < 0.001 < 0.001 < 0.001
(JSON) (Markdown) (JSON) (Markdown) (Markdown) (JSON) (Markdown) (JSON)
Table 1: Sensitivity of model performance to prompt format assessed using one-sided matched pair t-tests. Table
displays metrics for top and bottom formats (Max/Min) and p-values for each dataset/model. All p-values are below
0.05, except for GPT-4-1104-preview on HumanEval, confirming widespread prompt format sensitivity.
We begin by analyzing if model performance is sen- where N is the test set size and A represents the
sitive to any changes in the prompt format at all. To model’s answer. A higher score indicates greater
assess this, we conducted a one-sided matched pair answer consistency between prompts.
t-test, comparing the best and worst performing
formats for each model across various benchmarks. 4.2 Are larger models more consistent in
The resulting p-values, which are shown in Table generated outputs between templates?
1, are mostly below 0.01. This suggests that the Our study assessed the consistency of model out-
differences in model performance due to format puts using the MMLU and FIND datasets, as shown
changes are statistically significant. in Figures 2 and 8. For MMLU, we set the temper-
Figure 4 visualizes how the models fare across ature to zero to eliminate response variability. The
all benchmarks, highlighting a considerable range GPT-3.5-turbo series displayed low consistency,
in performance. For instance, in the FIND dataset, with scores below 0.5, and only 16% identical re-
both GPT-35-turbo-0613 and GPT-35-turbo-16k- sponses between Markdown and JSON formats.
0613 show a dramatic 200% improvement when In contrast, GPT-4’s consistency scores surpassed
prompts are switched from Markdown to plain text. 0.5, indicating better reliability across different
Similarly, for the HumanEval benchmark, the GPT- prompts. For the FIND dataset, following the set-
4 model with a 32k-0613 configuration exhibits an tings from (Schwettmann et al., 2023), GPT-4 again
impressive performance boost of over 300% when outperformed the GPT-3.5-turbo series in consis-
the prompt format is changed from JSON to plain tency. These findings suggest that larger models
text. This suggests, LLM performance may not like GPT-4 are more consistent, but there is still a
be robust to the choice of prompt format. need for model improvements to achieve reliable
3
(a) GPT-35-Turbo-0613 (b) GPT-35-turbo-16k-0613
4
7 Limitations Chess, Jack Clark, Christopher Berner, Sam Mc-
Candlish, Alec Radford, Ilya Sutskever, and Dario
This study was focused on GPT-based models, how- Amodei. 2020. Language models are few-shot learn-
ever, we plan to examine the impact of prompt for- ers. Preprint, arXiv:2005.14165.
mats on other models, such as LLaMA (Touvron
Sébastien Bubeck, Varun Chandrasekaran, Ronen El-
et al., 2023), Gemini (Team et al., 2023), PaLM
dan, Johannes Gehrke, Eric Horvitz, Ece Kamar, Pe-
(Chowdhery et al., 2022), or smaller models like ter Lee, Yin Tat Lee, Yuanzhi Li, Scott Lundberg,
Phi (Li et al., 2023) in the future. This would pro- et al. 2023. Sparks of artificial general intelligence:
vide a more holistic understanding of the influence Early experiments with gpt-4. arxiv. arXiv preprint
that prompt formatting exerts across different LLM arXiv:2303.12712.
families.
Nicholas Carlini, Daniel Paleka, Krishnamurthy Dj Dvi-
Moreover, there is an opportunity to enhance jotham, Thomas Steinke, Jonathan Hayase, A Feder
the breadth of template exploration in subsequent Cooper, Katherine Lee, Matthew Jagielski, Milad
studies. Our research did not include formats like Nasr, Arthur Conmy, et al. 2024. Stealing part
HTML or XML, which are prevalent in the train- of a production language model. arXiv preprint
arXiv:2403.06634.
ing datasets of many models. Incorporating these
formats could yield a more exhaustive examination Mark Chen, Jerry Tworek, Heewoo Jun, Qiming
of prompt format effects. Yuan, Henrique Ponde de Oliveira Pinto, Jared Ka-
Lastly, our experimental design maintained all plan, Harri Edwards, Yuri Burda, Nicholas Joseph,
other prompt design elements constant, isolating Greg Brockman, Alex Ray, Raul Puri, Gretchen
Krueger, Michael Petrov, Heidy Khlaaf, Girish Sas-
prompt format as the sole variable. It would be try, Pamela Mishkin, Brooke Chan, Scott Gray,
intriguing for future work to investigate how the Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz
sensitivity of models to prompt format might shift Kaiser, Mohammad Bavarian, Clemens Winter,
when other prompt engineering techniques are mod- Philippe Tillet, Felipe Petroski Such, Dave Cum-
mings, Matthias Plappert, Fotios Chantzis, Eliza-
ified. This includes varying the number of few- beth Barnes, Ariel Herbert-Voss, William Hebgen
shot examples provided or refining the precision Guss, Alex Nichol, Alex Paino, Nikolas Tezak, Jie
of prompt instructions. Such research could offer Tang, Igor Babuschkin, Suchir Balaji, Shantanu Jain,
valuable insights into the interplay between prompt William Saunders, Christopher Hesse, Andrew N.
Carr, Jan Leike, Josh Achiam, Vedant Misra, Evan
structure and model responsiveness, potentially in-
Morikawa, Alec Radford, Matthew Knight, Miles
forming more effective prompt engineering prac- Brundage, Mira Murati, Katie Mayer, Peter Welinder,
tices. Bob McGrew, Dario Amodei, Sam McCandlish, Ilya
Sutskever, and Wojciech Zaremba. 2021. Evaluating
large language models trained on code.
References
Aakanksha Chowdhery, Sharan Narang, Jacob Devlin,
Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Maarten Bosma, Gaurav Mishra, Adam Roberts,
Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Paul Barham, Hyung Won Chung, Charles Sutton,
Diogo Almeida, Janko Altenschmidt, Sam Altman, Sebastian Gehrmann, Parker Schuh, Kensen Shi,
Shyamal Anadkat, et al. 2023. Gpt-4 technical report. Sasha Tsvyashchenko, Joshua Maynez, Abhishek
arXiv preprint arXiv:2303.08774. Rao, Parker Barnes, Yi Tay, Noam Shazeer, Vin-
odkumar Prabhakaran, Emily Reif, Nan Du, Ben
Armen Aghajanyan. June 2023. Tweet: Susan and i Hutchinson, Reiner Pope, James Bradbury, Jacob
found mmlu performance jump 6-10 points in the Austin, Michael Isard, Guy Gur-Ari, Pengcheng Yin,
40s by formatting multiple choice as (a) not a in Toju Duke, Anselm Levskaya, Sanjay Ghemawat,
mmlu (for internal model). all evaluation of llm’s Sunipa Dev, Henryk Michalewski, Xavier Garcia,
are broken. evaluating a task requires marginalizing Vedant Misra, Kevin Robinson, Liam Fedus, Denny
across all prompts that describe the task, not point Zhou, Daphne Ippolito, David Luan, Hyeontaek Lim,
estimate of one. Barret Zoph, Alexander Spiridonov, Ryan Sepassi,
David Dohan, Shivani Agrawal, Mark Omernick, An-
Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie drew M. Dai, Thanumalayan Sankaranarayana Pil-
Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind lai, Marie Pellat, Aitor Lewkowycz, Erica Moreira,
Neelakantan, Pranav Shyam, Girish Sastry, Amanda Rewon Child, Oleksandr Polozov, Katherine Lee,
Askell, Sandhini Agarwal, Ariel Herbert-Voss, Zongwei Zhou, Xuezhi Wang, Brennan Saeta, Mark
Gretchen Krueger, Tom Henighan, Rewon Child, Diaz, Orhan Firat, Michele Catasta, Jason Wei, Kathy
Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Meier-Hellstern, Douglas Eck, Jeff Dean, Slav Petrov,
Clemens Winter, Christopher Hesse, Mark Chen, and Noah Fiedel. 2022. Palm: Scaling language mod-
Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin eling with pathways. Preprint, arXiv:2204.02311.
5
Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Pranab Sahoo, Ayush Kumar Singh, Sriparna Saha,
Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Vinija Jain, Samrat Mondal, and Aman Chadha.
2020. Measuring massive multitask language under- 2024. A systematic survey of prompt engineering in
standing. arXiv preprint arXiv:2009.03300. large language models: Techniques and applications.
Preprint, arXiv:2402.07927.
Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio
Petroni, Vladimir Karpukhin, Naman Goyal, Hein- Sarah Schwettmann, Tamar Rott Shaham, Joanna
rich Küttler, Mike Lewis, Wen tau Yih, Tim Rock- Materzynska, Neil Chowdhury, Shuang Li, Jacob An-
täschel, Sebastian Riedel, and Douwe Kiela. 2021. dreas, David Bau, and Antonio Torralba. 2023. Find:
Retrieval-augmented generation for knowledge- A function description benchmark for evaluating in-
intensive nlp tasks. Preprint, arXiv:2005.11401. terpretability methods. Preprint, arXiv:2309.03886.
Yuanzhi Li, Sébastien Bubeck, Ronen Eldan, Allie Melanie Sclar, Yejin Choi, Yulia Tsvetkov, and Alane
Del Giorno, Suriya Gunasekar, and Yin Tat Lee. 2023. Suhr. 2023. Quantifying language models’ sensitiv-
Textbooks are all you need ii: phi-1.5 technical report. ity to spurious features in prompt design or: How i
arXiv preprint arXiv:2309.05463. learned to start worrying about prompt formatting.
arXiv preprint arXiv:2310.11324.
Nelson F. Liu, Kevin Lin, John Hewitt, Ashwin Paran-
jape, Michele Bevilacqua, Fabio Petroni, and Percy Bangzhao Shu, Lechen Zhang, Minje Choi, Lavinia
Liang. 2023. Lost in the middle: How language mod- Dunagan, Dallas Card, and David Jurgens. 2023. You
els use long contexts. Preprint, arXiv:2307.03172. don’t need a personality test to know these models are
unreliable: Assessing the reliability of large language
Shuai Lu, Daya Guo, Shuo Ren, Junjie Huang, Alexey models on psychometric instruments. arXiv preprint
Svyatkovskiy, Ambrosio Blanco, Colin Clement, arXiv:2311.09718.
Dawn Drain, Daxin Jiang, Duyu Tang, et al. 2021.
Yuan Sui, Mengyu Zhou, Mingjie Zhou, Shi Han, and
Codexglue: A machine learning benchmark dataset
Dongmei Zhang. 2024. Table meets llm: Can large
for code understanding and generation. arXiv
language models understand structured table data? a
preprint arXiv:2102.04664.
benchmark and empirical study. In The 17th ACM
Yao Lu, Max Bartolo, Alastair Moore, Sebastian International Conference on Web Search and Data
Riedel, and Pontus Stenetorp. 2022. Fantastically Mining (WSDM ’24).
ordered prompts and where to find them: Over- Gemini Team, Rohan Anil, Sebastian Borgeaud,
coming few-shot prompt order sensitivity. Preprint, Yonghui Wu, Jean-Baptiste Alayrac, Jiahui Yu,
arXiv:2104.08786. Radu Soricut, Johan Schalkwyk, Andrew M Dai,
Anja Hauth, et al. 2023. Gemini: a family of
Microsoft. guidance. https://github.com/guidance-ai.
highly capable multimodal models. arXiv preprint
Microsoft. 2024. Azure openai service mod- arXiv:2312.11805.
els. https://learn.microsoft.com/en-us/ Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier
azure/ai-services/openai/concepts/models# Martinet, Marie-Anne Lachaux, Timothée Lacroix,
gpt-4-and-gpt-4-turbo-preview. Accessed: Baptiste Rozière, Naman Goyal, Eric Hambro,
2024-03-26. Faisal Azhar, et al. 2023. Llama: Open and effi-
cient foundation language models. arXiv preprint
OpenAI. 2023. Evals. https://github.com/openai/
arXiv:2302.13971.
evals.
Anton Voronov, Lena Wolf, and Max Ryabinin. 2024.
OpenAI. 2024. New embedding models and api updates. Mind your format: Towards consistent evaluation of
Accessed: 2024-03-26. in-context learning improvements. arXiv preprint
OpenAI. November 2023. Improved instruction follow- arXiv:2401.06766.
ing and json mode. Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le,
Ed Chi, Sharan Narang, Aakanksha Chowdhery, and
Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Car- Denny Zhou. 2023. Self-consistency improves chain
roll L. Wainwright, Pamela Mishkin, Chong Zhang, of thought reasoning in language models. Preprint,
Sandhini Agarwal, Katarina Slama, Alex Ray, John arXiv:2203.11171.
Schulman, Jacob Hilton, Fraser Kelton, Luke Miller,
Maddie Simens, Amanda Askell, Peter Welinder, Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten
Paul Christiano, Jan Leike, and Ryan Lowe. 2022. Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc Le, and
Training language models to follow instructions with Denny Zhou. 2023. Chain-of-thought prompting elic-
human feedback. Preprint, arXiv:2203.02155. its reasoning in large language models. Preprint,
arXiv:2201.11903.
Kishore Papineni, Salim Roukos, Todd Ward, and Wei-
Jing Zhu. 2002. Bleu: a method for automatic evalu- Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak
ation of machine translation. In Proceedings of the Shafran, Karthik Narasimhan, and Yuan Cao. 2023.
40th annual meeting of the Association for Computa- React: Synergizing reasoning and acting in language
tional Linguistics, pages 311–318. models. Preprint, arXiv:2210.03629.
6
Zhuosheng Zhang, Aston Zhang, Mu Li, and Alex includes investigations into the sequential arrange-
Smola. 2022. Automatic chain of thought ment of context ((Liu et al., 2023; Zhao et al., 2021;
prompting in large language models. Preprint,
Lu et al., 2022)) and the design of prompt formats
arXiv:2210.03493.
((Sclar et al., 2023; Voronov et al., 2024; Shu et al.,
Tony Z. Zhao, Eric Wallace, Shi Feng, Dan Klein, 2023)). Our work contributes to this growing body
and Sameer Singh. 2021. Calibrate before use: Im- of literature by examining the impact of prompt
proving few-shot performance of language models.
Preprint, arXiv:2102.09690. formatting on the performance of LLMs.
Qinkai Zheng, Xiao Xia, Xu Zou, Yuxiao Dong, Shan Prompt Format The sensitivity of LLMs to
Wang, Yufei Xue, Zihan Wang, Lei Shen, Andi Wang, prompt construction is a well-documented phe-
Yang Li, Teng Su, Zhilin Yang, and Jie Tang. 2023. nomenon, yet research on the impact of prompt
Codegeex: A pre-trained model for code genera-
formats on model performance remains sparse. Pi-
tion with multilingual evaluations on humaneval-x.
Preprint, arXiv:2303.17568. oneering studies ( (Sclar et al., 2023; Voronov et al.,
2024; Shu et al., 2023)) have conducted rigorous
A Related Work investigations, revealing that widely used open-
source LLMs exhibit extreme sensitivity to vari-
Prompt Engineering The field of prompt engi-
ations in prompt format. These studies, however,
neering has garnered significant interest in recent
primarily focus on subtle, local changes to the for-
years, in parts due to the emergent capabilities of
mat—such as the number of colons following a
the most capable LLMs, while also trying to bet-
question, the insertion of newlines, or the selection
ter control their still unpredictable outcomes. A
of input/output verbalizers. Besides, their experi-
prominent strand of research within this domain
mental designs are confined to classification tasks,
concentrates on innovative prompting methodolo-
limiting the generalizability of findings across di-
gies. These include few-shot prompting ((Brown
verse tasks.
et al., 2020)), which enables models to adapt to
Our research diverges from these existing stud-
new tasks without extensive retraining, and Chain-
ies by examining the effects of global prompt for-
of-Thought prompting ((Wei et al., 2023)), both of
mat modifications on model performance, offering
which are designed to enhance the reasoning capa-
insights that are applicable to a broad spectrum
bilities of LLMs. Additionally, Automatic Chain-
of LLM-based tasks that necessitate prompt engi-
of-Thought (Auto-CoT) ((Zhang et al., 2022)) and
neering. The closest related work to ours is by
Self-Consistency ((Wang et al., 2023)) approaches
(Sui et al., 2024), which however only provides
have been developed to further refine these reason-
a cursory exploration of format influence and is
ing processes. To mitigate hallucinations in LLM
restricted to tabular data. To the best of our knowl-
outputs, techniques such as Retrieval Augmented
edge, our study is the first effort to systematically
Generation (RAG) ((Lewis et al., 2021)) and Re-
investigate the impact of global prompt format vari-
Act ((Yao et al., 2023)) have been introduced. A
ations - an inescapable aspect of prompt engineer-
thorough examination of these methodologies can
ing design decisions.
be found in the survey by (Sahoo et al., 2024).
In recent developments, a novel prompt program- B Datasets
ming framework ((Microsoft)) has been introduced,
which offers greater control and efficiency in gener- We evaluate six distinct benchmarks and classify
ating structured outputs. Our study diverges from them according to the nature of the task involved.
this approach by examining the effects of more NL2NL
prevalent and established prompt formats on LLMs,
as opposed to investigating formats that are newly • Massive Multitask Language Understand-
proposed and not widely adopted yet. Furthermore, ing (MMLU) (Hendrycks et al., 2020) covers
it is important to note that third-party tools are 57 subjects including 20 STEM subjects, 13
predominantly designed for integration with open- humanities subjects, 12 social sciences sub-
source models, which may not seamlessly extend jects and 12 other subjects. Each subject con-
to proprietary models such as GPT. Another similar tains at least 100 multiple choice questions,
vein of research is dedicated to the structural design which tests both world knowledge and prob-
of prompts, aiming to optimize task performance lem solving ability. We use the dev set which
without altering the inherent semantic content. This contains 5 questions per subjects as few-shot
7
examples, and use test set containing 14,079 benchmark for CODE. It was originally intro-
questions with different levels of difficulty to duced to address the lack of diversified bench-
evaluation model performance. We use accu- marks in code intelligence by providing a di-
racy score to measure model performance. verse set of tasks, including code translation.
We use the parallel code for Java and C-Sharp
• NER Finance: OpenAI Evals (OpenAI,
and vice versa. We use the test set containing
2023) is a framework containing a registry of
1000 parallel code in Java and C-Sharp to ex-
evaluations to test LLMs where NER Finance
periment the capabilities of the LLMs in trans-
is one of those. This dataset contains samples
lating code from one programming language
between one sentence to one paragraph long
to another. The performance of the LLMs
from financial documents. The task is to ex-
is evaluated using the BLEU (Papineni et al.,
tract the all of the entities in the document,
2002) score, which compares the generated
with the evaluation being if the LLM outputs
code to the reference code.
each entity, in order. We randomly sample
500 examples from this dataset. • HumanEval-X (Zheng et al., 2023) dataset is
a benchmark designed to evaluate the multi-
NL2Code
lingual capabilities of code generative models.
• HumanEval (Chen et al., 2021) is a bench- It contains 820 high-quality, human-crafted
mark dataset consisting of a collection of data samples, each accompanied by test cases.
Python programming problems, each accom- The dataset supports a variety of program-
panied by a function signature, a docstring ming languages, including Python, C++, Java,
outlining the problem to be solved, and a set of JavaScript, and Go. In this we experiment
unit tests that the correct implementation must with one of the above dimension of code-
pass. We use the evaluation metric pass@1, translation focusing on Java to Python. To
which checks if the the generated code passes accomplish this task, we combine the "decla-
the unit given unit tests in one attempt. We ration" and "canonical-solution" together to fi-
use all 164 samples in this dataset. nally get the overall function in the respective
language. "Declaration" contains the function
• FIND (Schwettmann et al., 2023): The Func-
declaration for the respective language and
tion Interpretation and Description (FIND)
"canonical solution" has the human-crafted
benchmark dataset is a natural language-to-
example solution for the language. Similar to
code generation task. The LLM is given 5
CODEXGLUE, we use the BLEU (Papineni
example inputs and outputs to an unknown
et al., 2002) score for measuring the perfor-
Python function and is tasked with reverse en-
mance.
gineering the original Python code. We evalu-
ate the benchmark by comparing the output of
test cases on a ground truth function with the C Prompt Templates
output from LLM generated functions. We In this section we provide examples of the four
use the "strings" category of functions for the prompt templates we used for the NER Finance
task consisting of 500 functions. We provide task. Prompts for all other tasks followed identi-
the LLM with 5 pairs of input and output for cal formatting. Variables that are injected into the
each function. To select these examples, we prompt are denoted by blue text wrapped in braces.
randomly sample from a dataset provided by For example a user ask being injected is denoted as
(Schwettmann et al., 2023) that contains exam- {USER ASK}.
ple test strings for each function. To evaluate
the generated function code, we use the string D Additional Research Questions
indicator metric introduced by (Schwettmann
et al., 2023) that measures the number of test D.1 How does the format in which
cases passed by the function. information is structured and presented
influence the ability to solve problems
Code2Code that require different skill sets?
• CODEXGLUE (Lu et al., 2021) stands for We analyze whether model’s sensitivity to prompt
General Language Understanding Evaluation format changes is related to the skills required to
8
Prompt Format Prompt Template
{persona} {instructions} {examples}
Plain text
{output format instructions} {user ask}
## Persona
{persona}
## Instructions
{instructions}
## Examples
Markdown
{examples}
## Output Format
{Output format instructions}
## User Question
{user ask}
Persona
- {persona}
Instructions
- {instructions}
Examples
YAML
- {examples}
Output format
- {output format instructions}
User question
- {user ask}
{
“Persona”: “{persona}”,
“Instructions”: ”{instructions}“,
“Examples”: “{examples}”,
JSON “Output format”: “{output format instructions}"
}
{
“User ask”: “{user ask}”
}
Table 2: Prompt templates considered in this paper. Placeholders are denoted with {variable name} and get replaced with task
specific context.
Plaintext
System:
You are a annotator working for large financial data company and are tasked with extracting named entities from
financial documents who follows strict guidelines for quality and formatting. The following sentence is from a financial
document. List the named entities in the order they appear in the sentence. If an entity appears multiple times, list
it multiples times. Entities should be stated in the format NAME - TYPE where TYPE can be PERSON, ORGANIZATION, or
LOCATION. State your final answer as a comma-separated list of entities enclosed in square brackets. Example: [Bank -
ORGANIZATION, Borrower - PERSON]. If there are no entities found, state your final answer as ’No entities found’. Provide
your chain of thought first and then respond with your final answer. Here is an example: {ICL EXAMPLE INPUT} {ICL EXAMPLE SOLUTION}
User:
{INPUT}
solve the task using the MMLU benchmark which formance spread exists across different tasks, and
comprises 57 subjects, categorized into four do- it’s not signified nor eliminated by specific skills
mains: humanities, social science, STEM, and oth- required. This suggests that the model’s sensitiv-
ers. Each domain encompasses various disciplines, ity to prompt formatting is a general characteristic,
necessitating distinct skill sets and knowledge for rather than being contingent on the specific skills
accurate question answering. or reasoning abilities required by different tasks.
Figure 5 breaks down the performance on Performance is influenced by how information is
MMLU dataset by domain. We observe the per- presented to it, regardless of the complexity or na-
9
Markdown
System:
## Persona
- You are a annotator working for large financial data company are tasked with extracting named entities from financial
documents who follows strict guidelines for quality and formatting.
## Instructions
- You will be given a sentence from a financial document. - List the named entities in the order they appear in the sentence.
- If an entity appears multiple times, list it multiples times.
- Provide your chain of thought first and then respond with your final answer.
## Output Format
- Entities should be stated in the format NAME - TYPE where TYPE can be PERSON, ORGANIZATION, or LOCATION.
- State your final answer as a comma-separated list of entities enclosed in square brackets. Example: [Bank - ORGANIZATION,
Borrower - PERSON].
- If there are no entities found, state your final answer as ’No entities found’.
## Example
### DOCUMENT
{ICL EXAMPLE INPUT}
### Solution
{ICL EXAMPLE SOLUTION}
User:
### DOCUMENT
{INPUT}
YAML
System:
Persona:
Description: You are a annotator working for large financial data company are tasked with extracting named entities from
financial documents who follows strict guidelines for quality and formatting.
Instructions:
- You will be given a sentence from a financial document.
- List the named entities in the order they appear in the sentence.
- If an entity appears multiple times, list it multiples times.
- Provide your chain of thought first and then respond with your final answer.
Output_Format:
Entities should be stated in the format NAME - TYPE where TYPE can be PERSON, ORGANIZATION, or LOCATION. State your final
answer as a comma-separated list of entities enclosed in square brackets.
Examples:
- Document: {ICL EXAMPLE INPUT}
- Solution: {ICL EXAMPLE SOLUTION}
User:
Task:
- Document: {INPUT}
ture of the task, the way in which a problem is mat and compare if the degree of dispersion can be
framed and communicated to the model can sig- attributed to the size of the model. We compute the
nificantly impact its ability to process and respond Coefficient of Mean Deviation (CMD) across all
to the information. Model performance is consis- the prompt templates for every model.
tently influenced by prompt formatting across P
|s(pi ) − s̄|
various tasks, regardless of the specific skills or CM D =
n · s̄
knowledge required.
10
JSON
System:
{
"Persona": "You are a annotator working for large financial data company are tasked with extracting named entities from
financial documents who follows strict guidelines for quality and formatting.",
"Instructions": [
"You will be given a sentence from a financial document.",
"List the named entities in the order they appear in the sentence.",
"If an entity appears multiple times, list it multiples times.",
"Provide your chain of thought first and then respond with your final answer."
],
"OutputFormat": "Entities should be stated in the format NAME - TYPE where TYPE can be PERSON, ORGANIZATION, or LOCATION.
State your final answer as a comma-separated list of entities enclosed in square brackets. Example: [Bank - ORGANIZATION,
Borrower - PERSON]. If there are no entities found, state your final answer as ’No entities found’.",
"Example": "{ICL EXAMPLE INPUT}\n{ICL EXAMPLE SOLUTION}"
}
User:
{
"Task": "{INPUT}"
}
Figure 4: Model performance across prompt formats on MMLU, HumanEval and CODEXGLUE. Performance
measurement for MMLU is accuracy, for HumanEval is pass@1 to hecks if the the generated code passes the unit
given unit tests in one attempt, for CODEXGLUE(Java2CS) is BLEU score. Plots for the remaining datasets are in
Figure 9.
Figure 5: Performance spread across models on MMLU benchmark per domain. Wide performance spread is
observed across domains that required different skills.
11
Table 4 for results. Analyzing the model outputs,
we find the poor performance is because most of
the time the model would generate chain of thought
in plain text, but did not continue with actually gen-
erating the code. The other models did not exhibit
this behavior for the JSON template. We hypothe-
size that this may be related to the OpenAI’s claim
about fixing laziness in task completion in the 0125
version of GPT-4-turbo (OpenAI, 2024). In sum-
mary, larger models are more robust to template
Figure 6: Coefficient of mean deviation (CMD) of scalar variation.
metrics for all the prompt templates. Figure shows
the CMDs across models and datasets. GPT-3.5 series E Complete Results
exhibit larger CMD scores across benchmarks than GPT-
4 series, indicating higher sensitivity to the choice of E.1 Additional results on model performance
format. under all templates across benchamrks.
E.2 IoU scores on all benchmarks.
trained on more data than GPT-3.5, and is clearly E.3 Dotplot on all benchmark datasets
the overall more capable model ((Achiam et al.,
2023; Bubeck et al., 2023; Carlini et al., 2024)). In
this section, we aim to ascertain whether an expan-
sion in general capability translates to enhanced
stability in response to changes in templates. The
CMDs for all the models across benchmarks are
presented in Figure 6.
A lower value of CMD indicates more robust-
ness to template variation. The results indicate that
the GPT-4-1106-preview model exhibits superior
robustness to format changes, maintaining a perfor-
mance dispersion consistently below 0.036 across
all benchmarks. In contrast, the GPT-4-32k-0613
model demonstrates less robustness relative to the
GPT-4-1106-preview, yet it outperforms the GPT-
3.5 series, with CMDs not exceeding 0.043. The
GPT-3.5 series displays a broader range of CMDs,
from 0.035 to 0.176, signifying a higher degree
of performance variability under different prompt
formats. GPT-4’s observed improvements may be
attributed to its enhanced ability to process data in
diverse formats. Moreover, it is possible that the ro-
bustness of the model is not adversely impacted by
format variations at the level of the last hidden layer
of prompt embedding. Notably, the GPT-4-1106-
preview model achieves greater robustness com-
pared to the GPT-4-32k-0613, corroborating exist-
ing evidence that suggests the former has a height-
ened proficiency in comprehending and generating
content in specific formats as instructed (OpenAI,
November 2023). Further examining GPT-4-32k-
0613’s performance, we notice the CMD on Hu-
manEval benchmark is extremely high, this is due
the extremely low score using JSON format, see
12
Model GPT-35-turbo-0613 GPT-35-turbo-16k-0613 GPT-4-1106-preview GPT-4-32k-0613
Plaintext 54.464 ± 18.300 54.184 ± 19.066 81.005 ± 12.979 80.638 ± 13.172
Markdown 50.021 ± 17.144 50.686 ± 17.436 81.252 ± 12.932 81.349 ± 13.158
YAML 56.355 ± 16.792 55.901 ± 16.347 80.758 ± 13.000 81.162 ± 13.110
JSON 59.705 ± 16.594 59.405 ± 17.092 73.918 ± 13.580 77.800 ± 13.725
Table 3: Model Performance on MMLU Benchmark. Accuracy is averaged over 57 different subjects.
Table 4: Model performance on HumanEval benchmark. We used all 164 samples for testing.
Table 5: Model performance on NER finance benchmark. We randomly sampled 500 samples for testing.
Table 6: Model performance on FIND benchmark. Test set includes 500 functions.
Table 7: Model performance on HumanEval-X, a Java to Python translation task The test set contains 164 data
samples.
Table 8: Model performance on Java to C# and C# to Java translation task. The test set contains 1000 samples in
Java and C#.
13
(a) MMLU (b) FIND
14
(a) GPT-35-Turbo-0613 (b) GPT-35-turbo-16k-0613
15
(a) MMLU (b) NER Finance
(g) FIND
16