0% found this document useful (0 votes)
13 views

Evaluating LLM Generations With A Panel of Diverse Models

Uploaded by

Kayky Ramos
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views

Evaluating LLM Generations With A Panel of Diverse Models

Uploaded by

Kayky Ramos
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 17

Replacing Judges with Juries:

Evaluating LLM Generations with a Panel of Diverse Models


Pat Verga
Sebastian Hofstätter, Sophia Althammer, Yixuan Su
Aleksandra Piktus, Arkady Arkhangorodsky, Minjie Xu, Naomi White
Patrick Lewis
Cohere

Abstract
Evaluated Models
R R+ GPT3.5 GPT4 C3-Haiku C3-Sonnet Mistral-L
1

As Large Language Models (LLMs) have become 2


arXiv:2404.18796v1 [cs.CL] 29 Apr 2024

more advanced, they have outpaced our abilities 3


to accurately evaluate their quality. Not only is

Rank
4
finding data to adequately probe particular model
properties difficult, but evaluating the correctness 5
of a model’s free-form generation alone is a chal- 6
lenge. To address this, many evaluations now rely 7
on using LLMs themselves as judges to score the Reference EM R Haiku GPT3.5 PoLL
0.8
quality of outputs from other LLMs. Evaluations
Kappa Score
(to Humans)

0.7
most commonly use a single large model like GPT-
4. While this method has grown in popularity, it 0.6
is costly, has been shown to introduce intra-model 0.5
Reference EM R Haiku GPT3.5 PoLL
bias, and in this work, we find that very large mod- Judges
els are often unnecessary. We propose instead to
evaluate models using a Panel of LLm evaluators Figure 1: Top: Rankings of model performance change
(PoLL). Across three distinct judge settings and drastically depending on which LLM is used as the
spanning six different datasets, we find that using judge on KILT-NQ. Bottom: The Panel of LLm evalua-
a PoLL composed of a larger number of smaller tors (PoLL) has the highest Cohen’s κ correlation with
models outperforms a single large judge, exhibits human judgements.
less intra-model bias due to its composition of dis-
joint model families, and does so while being over
rization (Lin, 2004), and heuristic string match
seven times less expensive.
methods, such as normalized exact match (EM) and
1 Introduction token level F1 for question answering (Rajpurkar
et al., 2016). However, these simplistic methods
Evaluating generative language models is a chal- commonly fail to analyze the intended property of
lenging task: not only is it difficult to find mean- interest. QA metrics, for example, invariably lead
ingful data to test the models, but evaluating to both false positive failures (e.g. superfluous to-
the correctness of a generated response is it- ken overlap) and more commonly false negatives
self a challenge. Multiple choice datasets like due to an incomplete set of gold reference answers
MMLU (Hendrycks et al., 2020) have become pop- (e.g. date format differences1 , inclusion of middle
ular in part by side-stepping the difficulty of evalu- initial in person’s name, etc.).
ating generations. However, multiple-choice ques- More recent methods have attempted to address
tions are in many ways probing a different property these issues by instead using trained or prompted
than that of a free-form generative task, which is models as evaluators (Sellam et al., 2020; Zheng
oftentimes closer to the downstream use-case.
1
Many automatic metrics have been used across We found that EM unjustly penalized Command models
for a tendency to write in Canadian or British English as QA
various tasks such as BLEU in machine transla- dataset annotations typically format dates in American MM-
tion (Papineni et al., 2002), ROUGE for summa- DD-YYYY format.
et al., 2024; Li et al., 2024b; Kocmi and Feder- is based solely on J’s internal model of what a
mann, 2023a; Shen et al., 2023). Prior work has quality output is. Here, score = J(a).
shown that model-based scoring methods often cor- Reference-based Scoring In other cases, the
relate better with human judgements than heuristic model is provided with some ’gold’ reference r,
metrics like EM (Bohnet et al., 2022; Zheng et al., which contains the information that should be in-
2024) and that strong evaluator models generalize cluded in a (e.g. (Zhu et al., 2023)). For example,
well across different tasks (Huang et al., 2024). in QA the reference would be the ’correct’ answer
Unfortunately, while the use of LLMs like GPT- to the question. In this case, score = J(a, r). This
4 as evaluators has become increasingly common, setting is explored in Sections 3.3 and 3.4.
it has also been observed that evaluator models Pair-wise Scoring Another very common setting
tend to have their own biases; often recognizing is in pair-wise scoring where the goal is to choose
and preferring their own outputs over those of other which of two outputs is better (e.g. (Zheng et al.,
models (Panickssery et al., 2024). Additionally, it 2024)). Given outputs a and b generated by two
is most common to use the largest, most universally models A and B, an evaluator J compares them
capable models as evaluators, which is both slow and generates a preference score over the outputs
and costly, limiting applicability and access. as score = J(a, b)2 . The form of the score can
In this paper, we perform experiments across vary based on the use-case, but it is often a three or
three settings (single-hop QA, multi-hop QA, and five point scale such as a > b, a ≈ b, a < b. This
Chatbot Arena), spanning six datasets, and make setting is used in Section 3.5.
the following contributions:
2.2 Panel of LLM Evaluators
1. We propose to evaluate LLM generations us- The above settings assume that all scoring is per-
ing a Panel of LLm evaluators (PoLL ) drawn formed by a single capable judge. However, as
from different model families rather than a outlined earlier, one of the largest issues with rely-
single large judge (Section 2). ing on a single model J, such as GPT-4, is that it
2. We show that using an instantiation of PoLL introduces intra-model bias. To address this we in-
correlates better with human judgements com- stead propose to score answer correctness based not
pared to a single large judge (GPT-4), while on a single judge, but instead on a panel composed
being over seven times cheaper (Sections 4.1 of multiple evaluator models. Similar pooling tech-
and 4.2). niques are used to reduce variance in human anno-
tations by normalizing out both natural variation in
3. In some scenarios, GPT-4 is a relatively weak human judgements caused by their own subjective
judge, exhibiting high variance with minor biases as well as human error (Voorhees, 1998).
changes to the prompt (Section 4.3). To calculate the PoLL score, each evaluator
model independently scores a given model output
4. Intra-model scoring bias is reduced by pooling just as they would in any of the scenarios outlined
judgements across a panel of heterogeneous above. Those individual scores are then pooled
evaluator models (Section 4.4). together through a voting function3 such that the
final score = f (j ∈ P : j(a)) where P is a panel
2 Methods composed of individual judges j and f is a voting
2.1 Background: LLM as a Judge function.

A judge evaluator model J is used to score the 3 Experimental Settings


output a from a test model A.
Single-point Scoring In some settings, J is 3.1 PoLL Composition and Voting
tasked with rating the quality of a single model In our experiments, We construct a PoLL from
output independently of any point of com- three models being drawn from three disparate
parison (e.g. (Kocmi and Federmann, 2023b)). model families (Command R, Haiku, and GPT-
Jprompt will often include natural language in-
2
structions on how the grading should be performed To avoid ordering bias, it is common to test both a − b
and b − a orderings as done in Section 3.5
(i.e. what properties constitute a good or bad out- 3
In this work we consider both max and average pool
put). Aside from the prompt instructions, the rating voting, though other functions could also work well.
3.5). We consider two different voting functions 3.4 Multi-hop Question Answering
for aggregating scores across the judges. For QA
Multi-hop question answering is similar to the
datasets, we use max voting, as all judgements are
single-hop setting described in Section 3.3. How-
binary [correct, incorrect]. For Chatbot Arena we
ever, the questions are designed such that models
instead use average pooling because judgements
must perform multiple rounds of retrieval to answer
are scores ranging from 1-5 and a three judge panel
sub-questions and collect sufficient evidence to ul-
often does not produce a clear majority decision4 .
timately answer the initial question. We perform
experiments on two datasets: Bamboogle (Press
3.2 Model Families et al., 2023) and HPQA. See Section A.1.2 for ad-
Command R Family: Command R (CMD-R, 35B) ditional details.
and Command R+ (CMD-R+, 103B) are open
weight models created by Cohere5 . We consider 3.5 Chatbot Arena Hard
Command R as one of the models in the PoLL. Chatbot Arena7 is a popular benchmark for eval-
GPT Family: GPT-3.5 and GPT-4 (OpenAI et al., uating LLM head-to-head performance. This
2024) are two of the most widely used models. crowd-sourced effort has users prompt a pair of
GPT-4 is typically the default choice for models as anonymized LLMs and rate which output is bet-
judges. We use GPT-3.5 as a member of our PoLL. ter (Zheng et al., 2024).
Claude-3 Family: Models built by Anthropic Recently, Li et al. (2024a) mined a ’hard’ sub-
include (in increasing size) Haiku, Sonnet, and set of 500 arena prompts and created an automatic
Opus (Anthropic, 2024). We include Haiku in our evaluation where an LLM judge (GPT-4) would
PoLL. score a model output against a baseline generation
Mistral Family: Mistral Large (Mistral-LG) and (coming from a different version of GPT-4) in a
Mistral Medium (Mistral-MD) are proprietary mod- head-to-head comparison. In their analysis, they de-
els created by Mistral6 . We did not conduct exper- termined that this evaluation correlated better with
iments with Mistral judges but evaluate their gen- the human judgements from the original Chatbot
erations as a point of comparison to have a model Arena annotations when compared against other
’unaffiliated’ with any judges. commonly used automated evaluation benchmarks.
At the time of writing, the repository only con-
3.3 Single-hop Question Answering tained generated answers from GPT-3.5 and GPT-4.
We therefore generated new outputs for all remain-
Our question answering (QA) tasks are open-book ing models and subsequently scored generated an-
settings where a model m is given a question q swers with the various judges using the original
and must retrieve evidence e from some retrieval codebase from Li et al. (2024a)8 . The raw scores
system (such as the internet or dense index over can be found in Table 7. Following the original
wikipedia) and must generate an answer g as g = work, we treat Chatbot Arena crowdsourced anno-
m(q, e). Each instance is accompanied by a human tations as ground truth for calculating correlation
annotated reference answer. between evaluator models and human judgements.
We run experiments on datasets from
KILT (Petroni et al., 2021) versions of Nat- 3.6 Prompting Judges
ural Questions (NQ) (Kwiatkowski et al., 2019),
As alluded to in Section 2.1, judge models need
TriviaQA (TQA) (Joshi et al., 2017), and Hot-
to be prompted in different ways depending on
potQA (HPQA) (Yang et al., 2018) (see Section
the particular task setup. Our QA experiments use
A.1.1 for more details).
reference-based scoring and our models prompts
4
contain few-shot in-context examples of valid and
We also experimented with max voting with average vot-
ing fallback which yielded similar results with an identical invalid q, a, r triples. We also experiment with
overall ranking. variations on these prompts in Section 4.3. Fur-
5
Release blog of Command R https://cohere.com/ ther details on judge prompts may be found in the
blog/command-r and Command R+ https://cohere.com/
blog/command-r-plus-microsoft-azure appendix.
6
Release blog of Mistral Large: https://mistral.ai/
7
news/mistral-large/ and Mistral Medium: https:// https://chat.lmsys.org/
8
mistral.ai/news/la-plateforme/. https://github.com/lm-sys/arena-hard
As an additional point of reference in our QA Judge NQ TQA HPQA
experiments, we include the commonly used exact
EM 0.651 0.827 0.662
match (EM) metric judgements. We specifically
GPT-4 0.627 0.841 0.830
use the ’containment’ version of EM from prior
work which is more amenable to LLM long-form CMD-R 0.734 0.902 0.815
generation (Liu et al., 2024) and checks if a ref- Haiku 0.749 0.894 0.873
erence answer string appears within the generated GPT-3.5 0.726 0.859 0.833
model response (after normalization). PoLL 0.763 0.906 0.867
In our experiments on Chatbot Arena Hard, we
follow the same pair-wise scoring setup of the orig- Table 1: Cohen’s Kappa Judge Model Performance on
inal work. We do not modify the judge prompts in Different Single-hop QA Datasets from KILT, where
any way from the original implementation. best results are indicated by the bold font and second
best results are underlined.
3.7 Human Judgements
To gather human reference judgements, we utilized Judge/Correlation Pearson Kendall Tau
Cohere’s internal highly-qualified annotation work-
GPT-4 0.817 0.667
force. Annotators were shown a single anonymized
model generated answer at a time along with the Haiku 0.883 0.722
original question and reference answer. Annotators GPT-3.5 0.883 0.730
were asked to judge whether the reference answer CMD-R 0.817 0.676
is semantically contained inside the generated an- PoLL 0.917 0.778
swer9 . See Section A.3 for additional details
Table 2: Pearson and Kendall-Tau correlations between
4 Results different judge models as compared to the rankings
4.1 Correlation to Human Judgements produced by the Chatbot Arena overall leaderboard.

Cohen’s κ Correlation Cohen’s kappa measures


inter-rater reliability, which quantifies the level of 4.2 Rank Correlation on Chatbot Arena
agreement between two or more raters or judges.
In Table 2 we can see how different judge model
The kappa statistic takes into account the possibil-
rankings correlate with human judgement. Follow-
ity of agreement occurring by chance, making it
ing Li et al. (2024a), we treat the crowd-sourced
a more robust measure than simple percent agree-
ELO rankings from Chatbot Arena as ground truth.
ment.
We calculate both Kendall Tau (Kendall, 1938) and
The formula for Cohen’s kappa is:
Pearson Correlation (Pearson, 1895) of the ranked
po − pe list produced by each of the judge methods with
κ=
1 − pe respect to this ground truth ranking. We find that
where po is the observed agreement proportion and PoLL is best correlated with the gold rankings, par-
pe is the expected agreement proportion by chance. ticularly at the top of the ranked list as shown in
This results in a value of κ ranging from -1 to 1. Figure 2.
Opinions vary on how scores should be interpreted,
4.3 Judgement Variance by Prompt Changes
but in general κ > 0.8 is considered a strong corre-
lation and κ > 0.6 is a moderate correlation. Based on the observation that GPT-4 was the weak-
In Table 1 we can see how the ratings from dif- est judge model on our KILT evaluations, we inves-
ferent evaluator judges, on different single-hop QA tigated how the model reacts to modifications to its
datasets from KILT, correlate with human judge- prompt. GPT-4 is the most powerful judge model
ments as measured by κ. We see that overall, PoLL we tested, yet it performed worse than less capable
has the strongest correlation across various tasks, models on what is essentially a fuzzy string match-
while GPT-4 is one of the weaker evaluators on this ing exercise. We hypothesize that may be because
particular task setup (See 4.3 for further analysis). GPT-4 is over-reasoning and injecting too much
9
background knowledge into determining the cor-
By asking for answer correctness alone, we hope to min-
imize any potential annotator bias or preference towards a rectness of an answer rather than simply aligning
particular model’s style. the gold reference with the generation. In Table 3,
Figure 2: Rankings of model performance on Chatbot
Arena Hard judged by GPT-4 or PoLL. Ranks are com-
Figure 3: Accuracy changes of different evaluation
pared to those in the original Chatbot Arena.
judges as compared to human judgements on HotPotQA
(multi-hop).
we can see how the correlation between GPT-4 and
human annotators varies as the prompt changes. In
all cases, having in-context examples improves the
performance over zero-shot and the most effective
strategy is an explicit instruction to the model not
to ’overthink’ and not to concern itself with the
wider factuality of the answers with respect to the
outside world.10 These changes bring the agree-
ment level for GPT-4 up to the level of GPT-3.5
when using our few-shot standard prompt, though
still below Command-R, Haiku, and PoLL.

Prompt Variant Kappa


Zero-shot 0.518
Few-Shot Standard 0.627 Figure 4: Accuracy changes of different evaluation
+No Instruction Line 0.594 judges as compared to human judgements on Bam-
+Move Instruction Line 0.637 boogle.
+Chat Formatted Shots 0.561
+’don’t overthink’ 0.725
across our multi-hop datasets. Figures 3 and 4
Table 3: Kappa values on NQ for different prompt vari- show results on HotPotQA and Bamboogle. We
ants with GPT-4 as judge. . For reference, GPT3.5 with can see how the different judges score different
the few-shot standard prompt achieves 0.726 (table 1) models and how far those predictions deviate from
human annotator decisions (the dotted line at 0).
We observe that overall, PoLL has the smallest
4.4 Judge Bias and Consistency spread in scores, with a standard deviation of 2.2,
One of the biggest motivators for replacing a single compared to EM and individual judges. GPT-3.5
large judge with a panel of heterogeneous models has the highest spread, with a standard deviation of
is to reduce bias in evaluation. To analyze to what 6.1. We also see in Figure 4 that the highest positive
extent this was the case, we compared the delta in delta for each individual model being scored occurs
absolute accuracy score for our individual judges when it is judged by itself.
and PoLL relative to scores by human annotators In Figure 2, we assessed the model rankings
10
produced on Chatbot Arena Hard by PoLL as com-
the ’don’t overthink’ prompt had a negligible positive
effect on Haiku, minor negative effect to Command R and a pared to the GPT-4 judge from the original work.
severe negative impact on GPT-3.5. The ’gold’ ranking appears on the diagonal and
represents the rankings from the original Chatbot Antoine Debugne, Alex Barbet, Max Bartolo, Acyr
Arena ELO. We find that PoLL rankings corre- Locatelli, Sara Hooker, and Phil Blunsom. Thanks
late better with the ground truth, particularly at also to Emma Strubell.
the top of the ranked list. We can clearly observe
intra-model bias as the GPT-4 judge ranks another
GPT-4 variant in position 2, higher than its actual References
position 4, which is in line with previous works Anthropic. 2024. The claude 3 model family: Opus,
that have also observed GPT-4’s preference for its sonnet, haiku.
own generations (Zheng et al., 2024; Panickssery Bernd Bohnet, Vinh Q Tran, Pat Verga, Roee Aha-
et al., 2024) roni, Daniel Andor, Livio Baldini Soares, Massimil-
iano Ciaramita, Jacob Eisenstein, Kuzman Ganchev,
4.5 Cost and Latency Jonathan Herzig, et al. 2022. Attributed question an-
swering: Evaluation and modeling for attributed large
At the time of writing, the cost of running our spe- language models. arXiv preprint arXiv:2212.08037.
cific instance of PoLL is $1.25/input11 + $4.25/out-
put, whereas the cost of running GPT-4 Turbo is Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou,
Mantas Mazeika, Dawn Song, and Jacob Steinhardt.
$10/input + $30/output. Depending on the ratio of 2020. Measuring massive multitask language under-
input-to-output tokens in a given task, running the standing. In International Conference on Learning
entire three model PoLL is seven to eight times less Representations.
expensive than running a single GPT-4 judge. Hui Huang, Yingqi Qu, Jing Liu, Muyun Yang, and
We did not run explicit latency evaluations and Tiejun Zhao. 2024. An empirical study of llm-
many factors such as model choice, serving plat- as-a-judge for llm evaluation: Fine-tuned judge
form, and more can impact speed. In general models are task-specific classifiers. arXiv preprint
arXiv:2403.02839.
though, running a collection of smaller models in
parallel (as in PoLL) is faster than a single big Mandar Joshi, Eunsol Choi, Daniel Weld, and Luke
model. Zettlemoyer. 2017. TriviaQA: A large scale distantly
supervised challenge dataset for reading comprehen-
sion. In Proceedings of the 55th Annual Meeting of
5 Conclusions and Limitations the Association for Computational Linguistics (Vol-
In this paper, we showed how a Panel of LLM Eval- ume 1: Long Papers), pages 1601–1611, Vancouver,
Canada. Association for Computational Linguistics.
uators composed of smaller models is not only an
effective method for evaluating LLM performance, Vladimir Karpukhin, Barlas Oğuz, Sewon Min, Patrick
but also reduces intra-model bias, latency, and cost. Lewis, Ledell Yu Wu, Sergey Edunov, Danqi
Chen, and Wen tau Yih. 2020. Dense passage re-
The benefits of PoLL are bolstered by the finding trieval for open-domain question answering. ArXiv,
that there is not a single ’best’ judge across all abs/2004.04906.
settings, while PoLL performs well consistently.
Maurice G Kendall. 1938. A new measure of rank
In this work we investigated only three evaluator correlation. Biometrika, 30(1/2):81–93.
settings and a limited number of judges and panel
compositions. While we showed that PoLL is an Tom Kocmi and Christian Federmann. 2023a. Large
language models are state-of-the-art evaluators of
effective alternative to a single large model in these translation quality. In Proceedings of the 24th An-
settings, further work is needed to see how broadly nual Conference of the European Association for Ma-
applicable the method is, for example, in math chine Translation, pages 193–203, Tampere, Finland.
or reasoning evaluations, where language models European Association for Machine Translation.
often struggle (Zheng et al., 2024). We also leave Tom Kocmi and Christian Federmann. 2023b. Large
the task of ’panel selection’, or identifying the best language models are state-of-the-art evaluators of
models to include in PoLL in terms of quality and translation quality. arXiv preprint arXiv:2302.14520.
cost, to future work. Tom Kwiatkowski, Jennimaria Palomaki, Olivia Red-
field, Michael Collins, Ankur Parikh, Chris Alberti,
Acknowledgement Danielle Epstein, Illia Polosukhin, Jacob Devlin, Ken-
ton Lee, Kristina Toutanova, Llion Jones, Matthew
We’d like to thank all of the team at Cohere and Kelcey, Ming-Wei Chang, Andrew M. Dai, Jakob
in particular Maxime Voisin, Victoria Spooner, Uszkoreit, Quoc Le, and Slav Petrov. 2019. Natu-
Priyanka Sen, Harry Moynehan, Erik Arakelyan, ral questions: A benchmark for question answering
research. Transactions of the Association for Compu-
11
per million tokens tational Linguistics, 7:452–466.
Tianle Li, Wei-Lin Chiang, Evan Frick, Lisa Dunlap, Anna Makanju, Kim Malfacini, Sam Manning, Todor
Banghua Zhu, Joseph E. Gonzalez, and Ion Stoica. Markov, Yaniv Markovski, Bianca Martin, Katie
2024a. From live data to high-quality benchmarks: Mayer, Andrew Mayne, Bob McGrew, Scott Mayer
The arena-hard pipeline. McKinney, Christine McLeavey, Paul McMillan,
Jake McNeil, David Medina, Aalok Mehta, Jacob
Zhen Li, Xiaohan Xu, Tao Shen, Can Xu, Jia-Chen Menick, Luke Metz, Andrey Mishchenko, Pamela
Gu, and Chongyang Tao. 2024b. Leveraging large Mishkin, Vinnie Monaco, Evan Morikawa, Daniel
language models for nlg evaluation: A survey. arXiv Mossing, Tong Mu, Mira Murati, Oleg Murk, David
e-prints, pages arXiv–2401. Mély, Ashvin Nair, Reiichiro Nakano, Rajeev Nayak,
Arvind Neelakantan, Richard Ngo, Hyeonwoo Noh,
Chin-Yew Lin. 2004. ROUGE: A package for auto- Long Ouyang, Cullen O’Keefe, Jakub Pachocki, Alex
matic evaluation of summaries. In Text Summariza- Paino, Joe Palermo, Ashley Pantuliano, Giambat-
tion Branches Out, pages 74–81, Barcelona, Spain. tista Parascandolo, Joel Parish, Emy Parparita, Alex
Association for Computational Linguistics. Passos, Mikhail Pavlov, Andrew Peng, Adam Perel-
man, Filipe de Avila Belbute Peres, Michael Petrov,
Nelson F. Liu, Kevin Lin, John Hewitt, Ashwin Paran- Henrique Ponde de Oliveira Pinto, Michael, Poko-
jape, Michele Bevilacqua, Fabio Petroni, and Percy rny, Michelle Pokrass, Vitchyr H. Pong, Tolly Pow-
Liang. 2024. Lost in the Middle: How Language ell, Alethea Power, Boris Power, Elizabeth Proehl,
Models Use Long Contexts. Transactions of the Asso- Raul Puri, Alec Radford, Jack Rae, Aditya Ramesh,
ciation for Computational Linguistics, 12:157–173. Cameron Raymond, Francis Real, Kendra Rimbach,
Carl Ross, Bob Rotsted, Henri Roussez, Nick Ry-
OpenAI, Josh Achiam, Steven Adler, Sandhini Agarwal, der, Mario Saltarelli, Ted Sanders, Shibani Santurkar,
Lama Ahmad, Ilge Akkaya, Florencia Leoni Ale- Girish Sastry, Heather Schmidt, David Schnurr, John
man, Diogo Almeida, Janko Altenschmidt, Sam Alt- Schulman, Daniel Selsam, Kyla Sheppard, Toki
man, Shyamal Anadkat, Red Avila, Igor Babuschkin, Sherbakov, Jessica Shieh, Sarah Shoker, Pranav
Suchir Balaji, Valerie Balcom, Paul Baltescu, Haim- Shyam, Szymon Sidor, Eric Sigler, Maddie Simens,
ing Bao, Mohammad Bavarian, Jeff Belgum, Ir- Jordan Sitkin, Katarina Slama, Ian Sohl, Benjamin
wan Bello, Jake Berdine, Gabriel Bernadett-Shapiro, Sokolowsky, Yang Song, Natalie Staudacher, Fe-
Christopher Berner, Lenny Bogdonoff, Oleg Boiko, lipe Petroski Such, Natalie Summers, Ilya Sutskever,
Madelaine Boyd, Anna-Luisa Brakman, Greg Brock- Jie Tang, Nikolas Tezak, Madeleine B. Thompson,
man, Tim Brooks, Miles Brundage, Kevin Button, Phil Tillet, Amin Tootoonchian, Elizabeth Tseng,
Trevor Cai, Rosie Campbell, Andrew Cann, Brittany Preston Tuggle, Nick Turley, Jerry Tworek, Juan Fe-
Carey, Chelsea Carlson, Rory Carmichael, Brooke lipe Cerón Uribe, Andrea Vallone, Arun Vijayvergiya,
Chan, Che Chang, Fotis Chantzis, Derek Chen, Sully Chelsea Voss, Carroll Wainwright, Justin Jay Wang,
Chen, Ruby Chen, Jason Chen, Mark Chen, Ben Alvin Wang, Ben Wang, Jonathan Ward, Jason Wei,
Chess, Chester Cho, Casey Chu, Hyung Won Chung, CJ Weinmann, Akila Welihinda, Peter Welinder, Ji-
Dave Cummings, Jeremiah Currier, Yunxing Dai, ayi Weng, Lilian Weng, Matt Wiethoff, Dave Willner,
Cory Decareaux, Thomas Degry, Noah Deutsch, Clemens Winter, Samuel Wolrich, Hannah Wong,
Damien Deville, Arka Dhar, David Dohan, Steve Lauren Workman, Sherwin Wu, Jeff Wu, Michael
Dowling, Sheila Dunning, Adrien Ecoffet, Atty Eleti, Wu, Kai Xiao, Tao Xu, Sarah Yoo, Kevin Yu, Qim-
Tyna Eloundou, David Farhi, Liam Fedus, Niko Felix, ing Yuan, Wojciech Zaremba, Rowan Zellers, Chong
Simón Posada Fishman, Juston Forte, Isabella Ful- Zhang, Marvin Zhang, Shengjia Zhao, Tianhao
ford, Leo Gao, Elie Georges, Christian Gibson, Vik Zheng, Juntang Zhuang, William Zhuk, and Barret
Goel, Tarun Gogineni, Gabriel Goh, Rapha Gontijo- Zoph. 2024. Gpt-4 technical report.
Lopes, Jonathan Gordon, Morgan Grafstein, Scott
Gray, Ryan Greene, Joshua Gross, Shixiang Shane Arjun Panickssery, Samuel R. Bowman, and Shi Feng.
Gu, Yufei Guo, Chris Hallacy, Jesse Han, Jeff Harris, 2024. Llm evaluators recognize and favor their own
Yuchen He, Mike Heaton, Johannes Heidecke, Chris generations. arXiv e-prints, pages arXiv–2401.
Hesse, Alan Hickey, Wade Hickey, Peter Hoeschele,
Brandon Houghton, Kenny Hsu, Shengli Hu, Xin Kishore Papineni, Salim Roukos, Todd Ward, and Wei-
Hu, Joost Huizinga, Shantanu Jain, Shawn Jain, Jing Zhu. 2002. Bleu: a method for automatic evalu-
Joanne Jang, Angela Jiang, Roger Jiang, Haozhun ation of machine translation. In Proceedings of the
Jin, Denny Jin, Shino Jomoto, Billie Jonn, Hee- 40th annual meeting of the Association for Computa-
woo Jun, Tomer Kaftan, Łukasz Kaiser, Ali Ka- tional Linguistics, pages 311–318.
mali, Ingmar Kanitscheider, Nitish Shirish Keskar,
Tabarak Khan, Logan Kilpatrick, Jong Wook Kim, Karl Pearson. 1895. Vii. note on regression and inheri-
Christina Kim, Yongjik Kim, Jan Hendrik Kirch- tance in the case of two parents. proceedings of the
ner, Jamie Kiros, Matt Knight, Daniel Kokotajlo, royal society of London, 58(347-352):240–242.
Łukasz Kondraciuk, Andrew Kondrich, Aris Kon-
stantinidis, Kyle Kosic, Gretchen Krueger, Vishal Fabio Petroni, Aleksandra Piktus, Angela Fan, Patrick
Kuo, Michael Lampe, Ikai Lan, Teddy Lee, Jan Lewis, Majid Yazdani, Nicola De Cao, James Thorne,
Leike, Jade Leung, Daniel Levy, Chak Ming Li, Yacine Jernite, Vladimir Karpukhin, Jean Maillard,
Rachel Lim, Molly Lin, Stephanie Lin, Mateusz et al. 2021. Kilt: a benchmark for knowledge in-
Litwin, Theresa Lopez, Ryan Lowe, Patricia Lue, tensive language tasks. In Proceedings of the 2021
Conference of the North American Chapter of the Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan
Association for Computational Linguistics: Human Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin,
Language Technologies, pages 2523–2544. Zhuohan Li, Dacheng Li, Eric Xing, et al. 2024.
Judging llm-as-a-judge with mt-bench and chatbot
Ofir Press, Muru Zhang, Sewon Min, Ludwig Schmidt, arena. Advances in Neural Information Processing
Noah A Smith, and Mike Lewis. 2023. Measuring Systems, 36.
and narrowing the compositionality gap in language
models. In Findings of the Association for Computa- Lianghui Zhu, Xinggang Wang, and Xinlong Wang.
tional Linguistics: EMNLP 2023, pages 5687–5711. 2023. Judgelm: Fine-tuned large language
models are scalable judges. arXiv preprint
Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and arXiv:2310.17631.
Percy Liang. 2016. Squad: 100,000+ questions for
machine comprehension of text. In Proceedings of
the 2016 Conference on Empirical Methods in Natu-
ral Language Processing, pages 2383–2392.

Thibault Sellam, Dipanjan Das, and Ankur Parikh. 2020.


BLEURT: Learning robust metrics for text genera-
tion. In Proceedings of the 58th Annual Meeting of
the Association for Computational Linguistics, pages
7881–7892, Online. Association for Computational
Linguistics.

Chenhui Shen, Liying Cheng, Xuan-Phi Nguyen, Yang


You, and Lidong Bing. 2023. Large language mod-
els are not yet human-level evaluators for abstrac-
tive summarization. In Findings of the Association
for Computational Linguistics: EMNLP 2023, pages
4215–4233, Singapore. Association for Computa-
tional Linguistics.

Noah Shinn, Federico Cassano, Ashwin Gopinath,


Karthik Narasimhan, and Shunyu Yao. 2024. Re-
flexion: Language agents with verbal reinforcement
learning. Advances in Neural Information Process-
ing Systems, 36.

Ellen M. Voorhees. 1998. Variations in relevance judg-


ments and the measurement of retrieval effectiveness.
In Proceedings of the 21st Annual International ACM
SIGIR Conference on Research and Development in
Information Retrieval, SIGIR ’98, page 315–323,
New York, NY, USA. Association for Computing
Machinery.

Liang Wang, Nan Yang, Xiaolong Huang, Linjun Yang,


Rangan Majumder, and Furu Wei. 2023. Improving
text embeddings with large language models. ArXiv,
abs/2401.00368.

Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio,


William Cohen, Ruslan Salakhutdinov, and Christo-
pher D. Manning. 2018. HotpotQA: A dataset for
diverse, explainable multi-hop question answering.
In Proceedings of the 2018 Conference on Empiri-
cal Methods in Natural Language Processing, pages
2369–2380, Brussels, Belgium. Association for Com-
putational Linguistics.

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak


Shafran, Karthik Narasimhan, and Yuan Cao. 2022.
React: Synergizing reasoning and acting in language
models. arXiv preprint arXiv:2210.03629.
A Appendix Judge Prompts We use the same prompt-
template structure for KILT judgements as for Mul-
A.1 Additional Experimental Details tihop judgements. We initially observed relatively-
lower levels of human-agreement when using the
A.1.1 Single-hop QA
multihop prompt verbatim applied to KILT results
Datasets KILT (Petroni et al., 2021) is a popular experiments. We hypothesise this was due to mis-
benchmark that measures a model’s open domain match in model-generated answer styles in the few-
question-answering capabilities with 1) a retrieval shot examples. We therefore create new judgement
step to retrieve relevant documents from a knowl- prompts the following procedure for each of the
edge source and 2) an answer generation step tak- KILT datasets, randomly sampling a set of labelled
ing into account the relevant documents from (1). fewshot examples from our pool human-labelled
We select three popular test sets from the KILT examples according to the following constraints:
benchmark, Natural Questions (NQ) (Kwiatkowski
et al., 2019), TriviaQA (TQA) (Joshi et al., 2017), • at least 1 annotated positive example from
and HotpotQA (HPQA) (Yang et al., 2018) for eval- each model family’s generations
uation. The Natural Questions test set consists of
• at least 1 annotated negative example from
real user questions sampled from a popular web
each model family’s generations
search engine. TriviaQA is a test set with com-
plex questions written by trivia enthusiasts. Hot- • at least 1 example where the human disagrees
potQA is a multi-hop question answering test set with keyword-based exact match score.
where questions require finding and reasoning over
multiple supporting documents to answer. For all We sweep over n shots in {8, 10, 16}, picking
three test sets, the knowledge source is the cor- the prompt that has strongest agreement on average
pus of Wikipedia documents form the KILT bench- across all judges for a small held-out portion of
mark (Petroni et al., 2021). human labels. The resuling judgement prompts
used for KILT in our experiments in the main paper
KILT Experimental Details: For KILT, we are listed in tables 12, 11 and 13.
use the DPR-formatted 100-word chunked KILT
GPT-4 judge prompt ablation Here we include
wikipedia dump (Karpukhin et al., 2020; Petroni
some additional details on prompt ablation for
et al., 2021).12 Then, for all questions, we retrieve
KILT NaturalQuestions mentioned in section 4.3:
and cache the top-10 snippets from the dump, using
the GTE-large dense retrieval model (Wang et al., • Zero-shot: the natural language instruction is
2023), ready to be used to generate answers from supplied as system call. We then directly ask
different models. the model to judge with no labelled examples

Generating answers for judgement: To gener- • Few-shot Standard: This is the prompt in
ate answers from all considered models, we ask table 11 which is used for all other model
the question as an unmodified chat message and judges.
pass in the snippets using the dedicated documents
parameter of the model family’s API where avail- • No instruction Line: Here we remove the nat-
able (Command-R Family)13 . For model families ural language instruction from the Few-shot
that do no have a specific documents api parame- standard prompt, on the hypothesis the instruc-
ter (GPT, we instead adopt the question answering tion is confusing the model. This hypothesis
prompt template used in Liu et al. (2024), which is turns out to be false as agreement actually
shown in table 9. drops. (-0.03 ∆κ)

12 • Move instruction Line: Here we modify the


The pre-chunked dump can be downloaded here:
https://github.com/facebookresearch/DPR/issues/ standard fewshot prompt by moving the in-
186#issuecomment-923268677 struction line into a separate system call. This
13
The prompt template that gets used when using results in a small improvement (+0.01 ∆κ)
this parameter is described at https://huggingface.
co/CohereForAI/c4ai-command-r-v01 with addi-
tional details at https://docs.cohere.com/docs/ • Chat-formatted shots: Here we modify the
prompting-command-r standard fewshot prompt by formatting each
fewshot example as a conversational turn be- Standard ’don’t overthink’
tween the user and the assistant. This reduces
EM 0.651 0.651
performance (-0.07 ∆κ)
GPT4 0.627 0.725

• ‘don’t overthink’: Here, we replace the in- CMD-R 0.734 0.687


struction line from the standard prompt’s Haiku 0.749 0.757
You will be given a Question and GPT-3.5 0.729 0.509
a Provided Answer. Judge whether PoLL 0.763 0.699
the Provided Answer is correct by
comparing it to the Reference Answer. Table 4: Kappa values for KILT NQ for the standard
prompt and the prompt optimized for GPT-4 (note that
Differently formatted dates, people
the EM baseline is not LLM-dependent and therefore
with missing middle names, and not dependent on prompt)
alternative spellings should all be
considered the same. If the Provided
Answer is correct say exactly "True", Generating answers for judgement – React
otherwise say "False". to a wording Agents: All models follow the basic REACT
which is intended to encourage the model setup for tool-use agents which incorporate a
to perform a simpler function and not in- Thought-Action-Observation loop to iteratively call
corporate external knowledge: "You are tools and reason over their outputs to find a final an-
judging whether a model has generated swer (Yao et al., 2022). We use LangChain tool-use
a correct answer to a question. Study implementations for each model when available.
the examples the user gives you as On Bamboogle, agents have access to an internet
they will be very informative for how search tool and on HotPotQA, agents have access
to do the task. The Reference Answers to a search tool over wikipedia that does dense em-
you get will be short. An model’s bedding search using Cohere EmbedV3 + Rerank
answer will be longer, and can be 3. All search tool calls return the top two most
considered correct if it contains the relevant documents to the model.
semantic content of short reference
Judge Prompt The prompt used for judging mul-
answer somewhere within it. Don’t
tihop answers can be found in table 10
worry about factuality with respect
to the real world, just judge the A.2 Additional Results
example based on what you see. No
need to overthink this task, it really A.2.1 Multi-hop QA
comes down to just soft matching. This In Tables 5 and 6 we can see κ correlations on
improves results for GPT-4 by +0.07 ∆κ. Ad- Bamboogle and HotPotQA multi-hop question an-
ditional small surface level changes and mov- swering.
ing the instruction to a system call lead to
an additional +0.03 ∆κ. The final optimized EM GPT-3.5 Cmd-R Haiku PoLL
prompt for GPT-4 can be found in table 14. Haiku 0.746 0.772 0.741 0.802 0.803
Sonnet 0.767 0.879 0.880 0.877 0.896
CMD-R 0.599 0.881 0.896 0.897 0.938
The agreements for different models using the Mistral-LG 0.841 0.760 0.936 0.936 0.872
prompt optimized for GPT-4 can be found in table GPT-3.5 0.849 0.886 0.935 0.935 0.935
4. GPT-4 0.677 0.862 0.877 0.877 0.903
Total 0.762 0.846 0.885 0.898 0.896
A.1.2 Multihop
Table 5: Kappa values for various chatbot models on
Bamboogle (Press et al., 2023) is 125 questions ad- Bamboogle
versarially constructed to require multiple internet
searches to successfully answer. HotpotQA asks
questions meant to be answered with an accom- A.2.2 Arena Hard Scores
panying wikipedia dump (we consider the subset Raw scores for Arena-hard coming from PoLL are
used in (Shinn et al., 2024)). shown in Table 7.
EM GPT-3.5 Cmd-R Haiku PoLL
Haiku 0.703 0.910 0.890 0.864 0.889
Sonnet 0.715 0.796 0.847 0.844 0.820
CMD-R 0.780 0.815 0.909 0.906 0.859
Mistral-LG 0.788 0.754 0.895 0.894 0.853
GPT-3.5 0.799 0.757 0.879 0.879 0.858
GPT-4 0.768 0.890 0.835 0.835 0.890
Total 0.773 0.827 0.886 0.849 0.871

Table 6: Kappa values for various chatbot models on


multihop HotPotQA

Model Score 95% CI


GPT-4-turbo 68.7 (-2.1, 2.2)
Sonnet 57.6 (-3.2, 2.2)
CMD-R+ 57.1 (-2.9, 2.5)
Haiku 55.9 (-2.1, 2.2)
GPT-4-0314 50.0 (0.0, 0.0)
Mistral-MD 49.0 (-2.3, 2.7)
CMD-R 48.0 (-1.9, 2.7)
Mistral-LG 43.7 (-1.9, 2.2)
GPT-3.5 41.2 (-2.1, 2.5)

Table 7: Scores from Arena Hard as scored by the PoLL


using average pooling.

A.3 Human Annotations


Annotations were performed by profession anno-
tators with diverse demographic and professional
backgrounds including novelists, copywriters, copy
editors, and journalists. Single-hop TriviaQA,
Bamboogle, and multi-hop HotPotQA were singly
annotated while Natural Questions and single-hop
HotPotQA are triple annotated. Experiments on
triple annotated datasets were based on a pooled
super-annotator decision based on majority voting.
The total number of annotations is shown in Table
8.
Table 8: Total Number of Human Annotations by Dataset

Dataset Annotated Examples Annotations per Example


Natural Questions 688 Triple
TriviaQA 316 Single
HotPotQA (single) 784 Triple
HotPotQA (multi) 595 Single
Bamboogle 750 Single

Write a high-quality answer for the given question using only the provided search results (some of
which might be irrelevant).

Document [1](Title: Lake Eyre basin) keeping pace with evaporation. In contrast, the flow of the
Mississippi could fill Lake Eyre in 22 days, that of the Amazon in just 3 days. Other lakes in the
basin include Lake Frome, Lake Yamma Yamma and Lake Hart. Geography Rivers. The Cooper Creek, Finke
River, Georgina River and Diamantina River are the four main rivers of the basin. Other desert rivers
include the Hale River, Plenty River and Todd River that flow from the south east of the Northern
Territory, south. In the
Document [2](Title: Lake Eyre basin) make it as far south as Lake Eyre, although the story is told
that this happened once early in the 20th century. In extreme events, water from the Finke River flows
into the Macumba River, which empties into Lake Eyre, a total distance from headwater streams of
around . Major tributaries include Ellery Creek, and the Palmer and Hugh Rivers. The Georgina River
system originates on the Barkly Tableland, near the Northern Territory-Queensland border, north-west
of Mount Isa and not far south of the Gulf of

...

Document [10](Title: Lake Eyre) from the north-east part of the Lake Eyre Basin|in outback
(south-west and central) Queensland|flow towards the lake through the Channel Country. The amount of
water from the monsoon determines whether water will reach the lake and, if it does, how deep the lake
will get. The average rainfall in the area of the lake is per year. The altitude usually attributed to
Kati Thanda{Lake Eyre refers to the deepest parts of the lake floor, in Belt Bay and the Madigan Gulf

Question: where does the water come from to fill lake eyre
Answer:

Table 9: Example of a single-hop question answering prompt from Liu et al. (2024) used for KILT answer generations
where LLM API does not have a documents parameter.
You will be given a Question and a Provided Answer. Judge whether the Provided Answer is correct by comparing it to the Reference Answer. Differently
formatted dates, people with missing middle names, and alternative spellings should all be considered the same. If the Provided Answer is correct say
exactly "True", otherwise say "False".

Question 1: "When did the president who set the precedent of a two term limit leave office?"
Provided Answer: "George Washington set the precedent of a two-term limit when he decided not to seek a third term in 1796. He left office in 4 March,
1797."
Reference Answer: "March 4, 1797"
Correct: True

Question 2: "Where does Śivarāma Swami conduct courses on Vaishnava Theology?"


Provided Answer: "Śivarāma Swami conducts courses on Vaishnava Theology at Bhaktivedanta Manor."
Reference Answer: "Where does Śivarāma Swami conduct courses on Vaishnava Theology?"
Correct: False

Question 3: "The most populous city in Punjab is how large (area wise)?"
Provided Answer: "Ludhiana, the most populous city in Punjab, covers an area of 3,767 square kilometres according to the website of Punjab’s Department
of Agriculture & Farmer Welfare. Another source states it is 310 square kilometres, making it the most densely populated urban centre in the state."
Reference Answer: "310 square kilometers"
Correct: True

Question 4: "Who was mayor of New York City when Fiorello H. La Guardia was born?" Provided Answer: "Frank Edson was mayor of New York City when Fiorello
H. La Guardia was born on December 11, 1882. Edson served as mayor from 1881 to 1882, and La Guardia himself later became the 99th Mayor of New York
City, serving from 1934 to 1946 or 1945."
Reference Answer: "William Russell Grace"
Correct: False

Question 5: "What is the third letter of the top level domain of the military?"
Provided Answer: "The third letter of the top-level domain for the military, ’.mil’, is ’l’.
The domain name ’mil’ is derived from the word ’military’. It is the only top-level domain for a country’s military, a legacy of the US military’s role
in the creation of the internet."
Reference Answer: "l"
Correct: True

Question 6: "Ed Oxenbould plays Tyler in a film directed by which famous director?"
Provided Answer: "Ed Oxenbould played Tyler in *The Visit*"
Reference Answer: "M. Night Shyamalan"
Correct: False

Question 7: "Alice David is the voice of Lara Croft in a video game developed by which company ?"
Provided Answer: "Alice David is the voice of Lara Croft in the French dub of the video game *Tomb Raider*."
Reference Answer: "Crystal Dynamics"
Correct: False

Question 8: "{QUESTION}"
Provided Answer: "{GEN ANSWER}"
Reference Answer: "{GOLD ANSWER}"
Correct:

Table 10: Multihop Judge Prompt


You will be given a Question and a Provided Answer. Judge whether the Provided Answer is correct by comparing it to the Reference Answer. Differently formatted dates, people with missing middle
names, and alternative spellings should all be considered the same. If the Provided Answer is correct say exactly "True", otherwise say "False".

Question 1: "when did the nfl start playing in london"


Provided Answer: "According to the provided search results, the NFL started playing regular season games in London as part of the NFL International Series in 2007. Specifically:

Document [5] states: "The NFL International Series was inaugurated in 2007 to host NFL regular season games outside the United States. Played at the new Wembley Stadium in London (rebuilt and
reopened in 2007), the series increased from one to two games for the 2013 season, to three games for the 2014 season, and then to four games from the 2017 season."

Document [9] also mentions: "Since 2007, the league has held multiple regular season games in London each season as part of NFL London Games, allowing the league to test solutions to some of the
challenges facing a hypothetical London franchise."

So 2007 marked the start of the NFL playing regular season games at the newly rebuilt Wembley Stadium in London as part of the International Series. Prior to that, there were some preseason games
played in London, but 2007 was the first year for regular season NFL games in the city."
Reference Answer: "since 2007"
Correct: True

Question 2: "where is wind power used most in the united states"


Provided Answer: "Iowa is the state that generates the highest percentage of energy from wind power, with more than 35% of its electric power coming from wind. Alternatively, North Dakota has the
highest per capita wind generation.

The state with the most installed wind power capacity is Texas, followed by Iowa. Texas’s capacity is around 20,000 MW, almost double that of Iowa’s 3,670 MW. The Roscoe Wind Farm in Texas is also
the world’s largest wind farm at 780 MW.

California has doubled its wind power capacity since 2002 and now gets about 5% of its energy from wind power. This is still lower than the twelve U.S. states that get more than 20% of their energy
from wind."
Reference Answer: "Texas"
Correct: True

Question 3: "who is the owner of phoenix mall in chennai"


Provided Answer: "The owner of Phoenix Marketcity in Chennai is Phoenix Mills Limited, as mentioned in the document titled "Phoenix Marketcity (Chennai)." This mall was jointly developed by Phoenix
Mills Limited and Crest Ventures Ltd."
Reference Answer: "Phoenix Mills Limited"
Correct: True

Question 4: "number of degree of freedom for plane mechanism"


Provided Answer: "In mechanics, the degree of freedom (DOF) of a mechanical system is the number of independent parameters that define its configuration. The number of degrees of freedom for a
mechanism on a two-dimensional plane is three, which consists of two translations and one rotation.

For a system of multiple bodies, the combined DOF is the sum of the DOFs of the individual bodies, minus the internal constraints on their relative motion. Therefore, a mechanism formed of multiple
rigid bodies may have more than three DOFs."
Reference Answer: "two"
Correct: False

Question 5: "who had the best strike rate among batsmen who scored over 200 runs in the world cup"
Provided Answer: "The provided documents do not specify the strike rate of all batsmen who scored over 200 runs in the World Cup. However, among the batsmen mentioned who have scored over 200 runs in
the World Cup, Jehan Mubarak has the highest ever strike rate in a T20 World Cup match for Sri Lanka, with a strike rate of 353.84 (Document [6]). But it’s important to note that this strike rate is
not necessarily reflective of his overall performance in the World Cup, as it could have been achieved in a single match.
For a more accurate comparison of strike rates among batsmen who scored over 200 runs in the World Cup, a comprehensive list of all such batsmen and their respective strike rates would be required."
Reference Answer: "Rohit Sharma"
Correct: False

Question 6: "who played ben stone son on law and order"


Provided Answer: "According to the provided search results, Ben Stone’s son Peter Stone was played by Philip Winchester. Some key details:
- Peter Stone is the son of Benjamin "Ben" Stone, the former Executive Assistant District Attorney portrayed by Michael Moriarty on Law & Order.
- Peter first appeared as a character on the Chicago P.D. episode "Justice", which served as a backdoor pilot for the short-lived spin-off Chicago Justice, where he was the lead character played by
Philip Winchester.
- After Chicago Justice was canceled, Philip Winchester reprised his role as Peter Stone in Season 19 of Law & Order: Special Victims Unit, where he became the Assistant District Attorney for the Sex
Crimes Bureau.
- In several SVU episodes, Peter describes having a strained relationship with his workaholic father Ben Stone, who neglected his family in favor of his prosecutorial duties.
So in summary, Philip Winchester played the role of Peter Stone, the son of the iconic Law & Order character Ben Stone, across the shows Chicago Justice and Law & Order: SVU."
Reference Answer: "Michael Moriarty"
Correct: False

Question 7: "who won so you think you can dance season 13"
Provided Answer: "I’m sorry, I couldn’t find any information about the winner of season 13 of So You Think You Can Dance." Reference Answer: "Leon ‘‘ Kida ’’ Burns"
Correct: False

Question 8: "when did sweet caroline start at red sox games"


Provided Answer: "The song "Sweet Caroline" started being played at Fenway Park, home of the Boston Red Sox, since at least 1997. However, it has been played consistently in the middle of the eighth
inning at every game since 2002."
Reference Answer: "at least 1997"
Correct: True

Question 9: "when did end of the road come out"


Provided Answer: "I found multiple results for "End of the Road":
## End of the Road (1944 film)
End of the Road is a 1944 American crime film directed by George Blair and written by Denison Clift, Gertrude Walker and Albert Beich. The film stars Edward Norris, John Abbott, June Storey, Jonathan
Hale, Pierre Watkin and Ted Hecht.
## End of the Road (1970 film)
End of the Road is a 1970 American comedy-drama film directed, co-written, and edited by Aram Avakian and adapted from a 1958 novel by John Barth. The film stars Stacy Keach, James Earl Jones and
Harris Yulin.
## End of the Road (Boyz II Men song)
End of the Road is a single recorded by American R&B group Boyz II Men for the "Boomerang" soundtrack. It was released on June 30, 1992.

## End of the Road (Jerry Lee Lewis song)


End of the Road is a song by Jerry Lee Lewis. It appeared on the 1976 Sun Records compilation album "The Original Jerry Lee Lewis" which was released by the Sun Record Company.
Which of these were you referring to?"
Reference Answer: "June 30 , 1992"
Correct: True

Question 10: "who played john connor in the original terminator"


Provided Answer: "John Connor is not physically portrayed by any actor in the original "Terminator" film released in 1984, as he is only mentioned and not seen. The character of John Connor is first
referred to in this film as the future leader of the human resistance against the machines, but he does not make an appearance."
Reference Answer: "Edward Furlong throughout the remainder of the film"
Correct: False

Question 11: "{QUESTION}"


Provided Answer: "{GEN ANSWER}"
Reference Answer: "{GOLD ANSWER}"
Correct:

Table 11: KILT NaturalQuestions judge prompt.


You will be given a Question and a Provided Answer. Judge whether the Provided Answer is correct by comparing it to the Reference Answer. Differently formatted dates, people with missing middle names, and alternative spellings
should all be considered the same. If the Provided Answer is correct say exactly "True", otherwise say "False".

Question 1: "Who was also an actor, Serri or John Fogerty?"


Provided Answer: "Based on the provided search results, there is no information indicating that John Fogerty has pursued acting as a profession or has been involved in acting projects. The search results focus on his music career,
including his time with Creedence Clearwater Revival, his solo career, and various performances and recognitions he has received over the years. Therefore, without information on Serri in the provided documents, it cannot be
determined if Serri was an actor. However, it is clear from the search results that John Fogerty is primarily known as a musician, singer, and songwriter, not as an actor."
Reference Answer: "Serri"
Correct: False

Question 2: "The 37th Disney animated feature film Tarzan features You’ll be in my heart by which artist?" Provided Answer: "The 37th Disney animated feature film Tarzan’s soundtrack, Tarzan: An Original Walt Disney Records,
includes the song You’ll Be in My Heart written and performed by Phil Collins. The song won the Academy Award for Best Original Song and the Golden Globe Award for Best Original Song. Collins was originally hired as a songwriter
but ended up performing the song as well."
Reference Answer: "Phil Collins"
Correct: True

Question 3: "Who is the current drummer of the band who did the song "What Lovers Do"?"
Provided Answer: "Unfortunately, none of the provided search results directly mention the current drummer for the band Maroon 5 who performed the song "What Lovers Do". The results provide background information on the song
itself, its music video, chart performance, and other details, but do not specify the band members or drummer.

Maroon 5 is a well-known pop rock band fronted by Adam Levine, but their full current lineup and specific drummer for the "What Lovers Do" era is not stated in these search results. To definitively answer who the drummer was,
additional information directly about Maroon 5ś band members around 2017 when "What Lovers Do" was released would be needed." Reference Answer: "Matt Flynn"
Correct: False

Question 4: "Who is older, Ferdi Taygan or Mahesh Bhupathi?"


Provided Answer: "Unfortunately, none of the provided search results contain any information about Ferdi Taygan or their age to compare with Mahesh Bhupathi’s age. The search results are all focused on providing biographical
details about Mahesh Bhupathi, an Indian professional tennis player, and some other individuals named Mahesh Bhatt who appear to be unrelated to the question. Without any information about Ferdi Taygan, it is impossible to
determine who is older between them and Mahesh Bhupathi based solely on these search results."
Reference Answer: "Ferdi Taygan"
Correct: False

Question 5: "What is the birthday of the actress who was the Duchess in "The Revengers Tragedy"?"
Provided Answer: "Diana Quick, who played the Duchess in "The Revengers Tragedy", was born on 23rd September 1934."
Reference Answer: "23 November 1946"
Correct: False

Question 6: "Rejuvelac is kind of grain water invented and promoted by a "holistic health" practitioner born in which year ?"
Provided Answer: "Rejuvelac is a kind of grain water invented and promoted by Ann Wigmore, who was born in 1909."
Reference Answer: "1909"
Correct: True

Question 7: "Who is the current drummer of the band who did the song "What Lovers Do"?"
Provided Answer: "The song "What Lovers Do" is by the American pop rock band Maroon 5. The current drummer of Maroon 5 is Matt Flynn."
Reference Answer: "Matt Flynn"
Correct: True

Question 8: "The 37th Disney animated feature film Tarzan features You’ll be in my heart by which artist?"
Provided Answer: "The song "Youĺl Be in My Heart" featured in the 1999 Disney animated film Tarzan was written and performed by Phil Collins.
The key details that support this answer from the provided search results:
1. Document [2] states: "Youĺl Be in My Heart" is a song by Phil Collins, from the 1999 Disney animated feature "Tarzan"."

2. Document [3] mentions: "For his contribution to the soundtrack, Collins received an American Music Award for Favorite Adult Contemporary Artist."

3. Document [5] says: "Collins was originally hired by Disney as a songwriter, but was chosen because he was a drummer at first and given his history as a drummer for the rock band Genesis, he appealed to the Disney production
crew, who "wanted a strong jungle beat to accompany Tarzanś adventures". This "ballad" is one of five original songs he wrote for the film, and he ended up also performing this song as well."

4. Document [7] states: "The song is about "how love is a bond that cannot be broken"."

So in summary, the artist who wrote and performed the song "Youĺl Be in My Heart" featured in the Disney animated film Tarzan is Phil Collins."
Reference Answer: "Phil Collins"
Correct: True

Question 9: "How long is the river for which Frenchmans Creek is a tributary?"
Provided Answer: "The length of the river for which Frenchmans Creek is a tributary cannot be directly determined from the provided documents as they only provide information about the Frenchmans Creek itself and not the river it
flows into. However, we can see that Frenchmans Creek is a tributary of the Darling River in New South Wales, Australia (Document 2), the Mississippi River in Missouri, USA (Document 3), and the Milk River in Saskatchewan, Canada,
and Montana, USA (Document 5). The length of these rivers is not provided in the documents."
Reference Answer: "2844 km"
Correct: False

Question 10: "Conrad Anker located the body of a mountaineer who took part it the first three British expeditions to which mountain?"
Provided Answer: "Conrad Anker located the body of George Mallory on Mount Everest. Mallory was a British climber who took part in the first three British expeditions to Mount Everest, including the 1924 expedition during which
he and his climbing partner Andrew Irvine disappeared while attempting to reach the summit. Anker found Mallory’s body in 1999 as a member of a search team looking for the remains of the British climber.
References: [1] Conrad Anker [2] 2007 Altitude Everest expedition [9] Timeline of Mount Everest expeditions"
Reference Answer: "Mount Everest"
Correct: True

Question 11: "The city that contains the Yunnan Provincial Museum is also known by what nickname?"
Provided Answer: "The city that contains the Yunnan Provincial Museum, Kunming, is also known by the nickname "Spring City.""
Reference Answer: "Spring city"
Correct: True

Question 12: "The 1st Virginia Regiment saw action at a battle that led to the departure of what Royal Governor?" Provided Answer: "Based on the provided search results, the 1st Virginia Regiment saw action at the Battle of Great
Bridge in December 1775, which led to the departure of Lord Dunmore, the Royal Governor of Virginia at the time.

Document [1] mentions that the 1st Virginia Regiment participated in the Battle of Great Bridge. Document [4] states that after the Battle of Great Bridge, "Dunmore responded by bombarding Norfolk with his ships on January 1,
1776." This implies that the Battle of Great Bridge, in which the 1st Virginia Regiment fought, led to Lord Dunmoreś departure from Virginia as the Royal Governor.

So the 1st Virginia Regimentś involvement in the Battle of Great Bridge resulted in the departure of Lord Dunmore, the Royal Governor of the Colony of Virginia during the American Revolutionary War."
Reference Answer: "Royal Governor Lord Dunmore"
Correct: True
Question 13: "The voice of Homer Simpson also voices what alien character in the television series "The Simpsons"?"
Provided Answer: "Dan Castellaneta voices Homer Simpson, Grampa Simpson, Krusty the Clown, Groundskeeper Willie, Mayor Quimby, Barney Gumble, and other adult characters."
Reference Answer: "Kodos"
Correct: False

Question 14: "Who were the producers of the show for which Chris McKay made his feature film directorial debut?" Provided Answer: "Chris McKay made his feature film directorial debut with "The Lego Batman Movie" in 2017. The
producers of "The Lego Batman Movie" were not explicitly mentioned in the provided documents. However, it is noted that Seth Grahame-Smith co-wrote the screenplay for the movie, and Warner Bros. was involved in its development,
as mentioned in Document 8. Additionally, Document 9 mentions that McKay had signed a first look deal for being a producer with Warner Bros. around the time of the movieś development. While this information does not directly
answer the question regarding the specific producers of "The Lego Batman Movie," it does highlight the key individuals and entities involved in its creation and development." Reference Answer: "Dan Lin, Roy Lee, Phil Lord and
Christopher Miller" Correct: False

Question 15: "Marion Greene was a health policy analyst for St. Judt Medical company, which had how many principal operations worldwide?"
Provided Answer: "The provided documents do not specify the number of principal operations St. Jude Medical company had worldwide. Document 1 mentions that Marion Greene was a health policy analyst for St. Jude Medical, but it
does not provide details about the company’s operations. Document 2 also mentions Greene’s work for St. Jude Medical, but again, no information is given about the company’s operations. The other documents do not pertain to St.
Jude Medical. Therefore, I cannot provide an answer to this question based on the given documents."
Reference Answer: "20"
Correct: False
Question 16: "What is the name of the actress who has received a Drama Desk award and was part of the cast album for the 2007 Broadway production of the hit musical, "Grease"?"
Provided Answer: "Laura Osnes, who played the role of Sandy Dumbrowski in the 2007 Broadway production of "Grease", is a Drama Desk award winner and is featured on the cast album for the production."
Reference Answer: "Laura Ann Osnes"
Correct: True

Question 17: "{QUESTION}"


Provided Answer: "{GEN ANSWER}"
Reference Answer: "{GOLD ANSWER}"
Correct:

Table 12: KILT HotpotQA Judge prompt


You will be given a Question and a Provided Answer. Judge whether the Provided Answer is correct by comparing it to the Reference Answer. Differently formatted dates, people with
missing middle names, and alternative spellings should all be considered the same. If the Provided Answer is correct say exactly "True", otherwise say "False".

Question 1: "If you landed at ’Santa Catarina Airport on which Island would you be in?"
Provided Answer: "If you landed at Santa Catarina Airport, you would be on the island of Santa Catarina, in the Brazilian state of the same name. The island is approximately 54km long
and 18km wide and is home to the state capital, Florianópolis. It is the largest island in an archipelago of over 30 islands."
Reference Answer: "Geography of the Madeira Islands"
Correct: False

Question 2: "When it was a first performed at the Abbey Theatre in Dublin in 1907, which play caused rioting in the streets because of objections to the portrayal of rural Irish
Catholic immorality?"
Provided Answer: "Based on the provided search results, the play that caused rioting in the streets when it was first performed at the Abbey Theatre in Dublin in 1907 due to
objections over the portrayal of rural Irish Catholic immorality was "The Playboy of the Western World" by J.M. Synge.

Several documents mention the riots that erupted during and after the opening performance of Syngeś "The Playboy of the Western World" at the Abbey Theatre in January 1907. The riots
were instigated by Irish nationalists and republicans who viewed the playś content as offensive to public morals and an insult to Ireland (Documents 1, 3, 7).

Document 5 specifically states that "the first performance of J.M. Syngeś play T́he Playboy of the Western Worldát the Abbey Theatre in Dublin triggers a week of rioting" in 1907.
Document 9 also notes that a performance of the play on January 28, 1907 "was interrupted by the audience who continue to boo, hiss and shout."

The objections seem to have stemmed from the playś depiction of patricide and perceived immorality in rural Irish Catholic society, which nationalists found offensive (Document 7). So
"The Playboy of the Western World" by J.M. Synge was the controversial play that provoked riots when first staged at the Abbey Theatre in 1907 due to its portrayal of the Irish
countryside."
Reference Answer: "’THE PLAYBOY OF THE WESTERN WORLD’"
Correct: True

Question 3: "The religious order of Poor Ladies of San Damiano has what more common name?" Provided Answer: "The religious order of Poor Ladies of San Damiano is more commonly known
as the Poor Clares. This order was founded by Saints Clare of Assisi and Francis of Assisi in 1212, and it was the second Franciscan Order to be established. The Poor Clares were
initially referred to as the Order of Poor Ladies, and later the Clarisses, the Minoresses, the Franciscan Clarist Order, and the Second Order of Saint Francis. However, the name Poor
Clares is the most recognized and widely used today."
Reference Answer: "Franciscan Clarist Order"
Correct: False

Question 4: "Which country defeated the Ivory Coast to win football’s Africa Cup of Nations?"
Provided Answer: "Based on the provided search results, Ghana defeated the Ivory Coast in the final of the 1992 Africa Cup of Nations. The key evidence is from Document 1 and Document
10:

Document 1 states: "Ivory Coast set a record for the competition by holding each of their opponents scoreless in the six matches of the final tournament after his contributions helped
Ghana reach the final; he was, however, suspended for that match and Ghana lost to Ivory Coast in a penalty shootout that saw each side make 11 attempts to determine the winner."

Document 10 directly mentions: "The 1992 Africa Cup of Nations Final was a football match that took place on January 26, 1992, at the Stade de lÁmitié in Dakar, Senegal, to determine
the winner of the 1992 African Cup of Nations. Ivory Coast defeated Ghana 11{10 on penalties after a goalless draw to win their first African Cup."

So the Ivory Coast defeated Ghana in the 1992 Africa Cup of Nations final after a penalty shootout. None of the other provided documents contradict this or indicate another country
defeated the Ivory Coast to win that specific tournament."
Reference Answer: "health care in zambia"
Correct: False

Question 5: "Fred Scuttle was one of the best-known characters of which British comedian?"
Provided Answer: "The provided documents do not directly mention Fred Scuttle or the British comedian associated with this character. Therefore, based on the information available in
the documents, it is not possible to accurately answer the question about which British comedian is best known for the character Fred Scuttle."
Reference Answer: "Alfred Hawthorne Hill"
Correct: False

Question 6: "What element is the main constituent of anthracite?"


Provided Answer: "The main constituent of anthracite is carbon. High-grade anthracite, for example, has a chemical formula represented by CHONS, indicating it contains carbon,
hydrogen, oxygen, nitrogen, and sulfur, with carbon making up 94% of its composition. Ultra high-grade (UHG) anthracite typically has a minimum carbon content of 95%, further
emphasizing carbon as the primary element in anthracite."
Reference Answer: "Carbon"
Correct: True

Question 7: "Which American played in his 10th Ryder Cup in 2014?"


Provided Answer: "Mickelson extended his own U.S. team record with his 10th appearance at the 2014 Ryder Cup."
Reference Answer: "philip alfred mickelson"
Correct: True

Question 8: "Who is the lead singer of the Sheffield band the Arctic Monkeys?"
Provided Answer: "The lead singer of the Sheffield band Arctic Monkeys is Alex Turner. Turner also plays guitar and is the frontman of the four-piece indie rock band.
Arctic Monkeys have released six studio albums, including a debut in 2002 titled "Whatever People Say I Am, That’s What I’m Not"."
Reference Answer: "Alex Turner"
Correct: True

Question 9: "{QUESTION}"
Provided Answer: "{GEN ANSWER}"
Reference Answer: "{GOLD ANSWER}"
Correct:

Table 13: KILT TriviaQA Judge Prompt


SYSTEM: You are judging whether a model has generated a correct answer to a question. Study the examples the user gives you as they will be very informative for how to do the task. The Reference
Answers you get will be short. An model’s answer will be longer, and can be considered correct if it contains the semantic content of short reference answer somewhere within it. Don’t worry about
factuality with respect to the real world, just judge the example based on what you see. No need to overthink this task, it really comes down to just soft matching. Answer with only the word ’Yes’ or
’No’

USER: Question: when did the nfl start playing in london Model Answer: According to the provided search results, the NFL started playing regular season games in London as part of the NFL
International Series in 2007. Specifically:

Document [5] states: "The NFL International Series was inaugurated in 2007 to host NFL regular season games outside the United States. Played at the new Wembley Stadium in London (rebuilt and
reopened in 2007), the series increased from one to two games for the 2013 season, to three games for the 2014 season, and then to four games from the 2017 season."

Document [9] also mentions: "Since 2007, the league has held multiple regular season games in London each season as part of NFL London Games, allowing the league to test solutions to some of the
challenges facing a hypothetical London franchise."

So 2007 marked the start of the NFL playing regular season games at the newly rebuilt Wembley Stadium in London as part of the International Series. Prior to that, there were some preseason games
played in London, but 2007 was the first year for regular season NFL games in the city.
Reference Answer: since 2007
Correct: Yes
-----
Question: where is wind power used most in the united states
Model Answer: Iowa is the state that generates the highest percentage of energy from wind power, with more than 35% of its electric power coming from wind. Alternatively, North Dakota has the highest
per capita wind generation.

The state with the most installed wind power capacity is Texas, followed by Iowa. Texas’s capacity is around 20,000 MW, almost double that of Iowa’s 3,670 MW. The Roscoe Wind Farm in Texas is also
the world’s largest wind farm at 780 MW.

California has doubled its wind power capacity since 2002 and now gets about 5% of its energy from wind power. This is still lower than the twelve U.S. states that get more than 20% of their energy
from wind.
Reference Answer: Texas
Correct: Yes
-----
Question: who is the owner of phoenix mall in chennai
Model Answer: The owner of Phoenix Marketcity in Chennai is Phoenix Mills Limited, as mentioned in the document titled "Phoenix Marketcity (Chennai)." This mall was jointly developed by Phoenix
Mills Limited and Crest Ventures Ltd.
Reference Answer: Phoenix Mills Limited
Correct: Yes
-----
Question: number of degree of freedom for plane mechanism
Model Answer: In mechanics, the degree of freedom (DOF) of a mechanical system is the number of independent parameters that define its configuration. The number of degrees of freedom for a mechanism
on a two-dimensional plane is three, which consists of two translations and one rotation.
For a system of multiple bodies, the combined DOF is the sum of the DOFs of the individual bodies, minus the internal constraints on their relative motion. Therefore, a mechanism formed of multiple
rigid bodies may have more than three DOFs.
Reference Answer: two
Correct: No
-----
Question: who had the best strike rate among batsmen who scored over 200 runs in the world cup
Model Answer: The provided documents do not specify the strike rate of all batsmen who scored over 200 runs in the World Cup. However, among the batsmen mentioned who have scored over 200 runs in the
World Cup, Jehan Mubarak has the highest ever strike rate in a T20 World Cup match for Sri Lanka, with a strike rate of 353.84 (Document [6]). But it’s important to note that this strike rate is not
necessarily reflective of his overall performance in the World Cup, as it could have been achieved in a single match.

For a more accurate comparison of strike rates among batsmen who scored over 200 runs in the World Cup, a comprehensive list of all such batsmen and their respective strike rates would be required.
Reference Answer: Rohit Sharma
Correct: No
-----
Question: who played ben stone son on law and order
Model Answer: According to the provided search results, Ben Stone’s son Peter Stone was played by Philip Winchester. Some key details:

- Peter Stone is the son of Benjamin "Ben" Stone, the former Executive Assistant District Attorney portrayed by Michael Moriarty on Law & Order.

- Peter first appeared as a character on the Chicago P.D. episode "Justice", which served as a backdoor pilot for the short-lived spin-off Chicago Justice, where he was the lead character played by
Philip Winchester.

- After Chicago Justice was canceled, Philip Winchester reprised his role as Peter Stone in Season 19 of Law & Order: Special Victims Unit, where he became the Assistant District Attorney for the Sex
Crimes Bureau.

- In several SVU episodes, Peter describes having a strained relationship with his workaholic father Ben Stone, who neglected his family in favor of his prosecutorial duties.

So in summary, Philip Winchester played the role of Peter Stone, the son of the iconic Law & Order character Ben Stone, across the shows Chicago Justice and Law & Order: SVU.
Reference Answer: Michael Moriarty
Correct: No
-----
Question: who won so you think you can dance season 13
Model Answer: I’m sorry, I couldn’t find any information about the winner of season 13 of So You Think You Can Dance.
Reference Answer: Leon ‘‘ Kida ’’ Burns
Correct: No
----- Question: when did sweet caroline start at red sox games
Model Answer: The song "Sweet Caroline" started being played at Fenway Park, home of the Boston Red Sox, since at least 1997. However, it has been played consistently in the middle of the eighth
inning at every game since 2002.
Reference Answer: at least 1997
Correct: Yes
-----
Question: when did end of the road come out
Model Answer: I found multiple results for "End of the Road":

## End of the Road (1944 film)


End of the Road is a 1944 American crime film directed by George Blair and written by Denison Clift, Gertrude Walker and Albert Beich. The film stars Edward Norris, John Abbott, June Storey, Jonathan
Hale, Pierre Watkin and Ted Hecht.

## End of the Road (1970 film)


End of the Road is a 1970 American comedy-drama film directed, co-written, and edited by Aram Avakian and adapted from a 1958 novel by John Barth. The film stars Stacy Keach, James Earl Jones and
Harris Yulin.

## End of the Road (Boyz II Men song)


End of the Road is a single recorded by American R&B group Boyz II Men for the "Boomerang" soundtrack. It was released on June 30, 1992.

## End of the Road (Jerry Lee Lewis song)


End of the Road is a song by Jerry Lee Lewis. It appeared on the 1976 Sun Records compilation album "The Original Jerry Lee Lewis" which was released by the Sun Record Company.

Which of these were you referring to?


Reference Answer: June 30 , 1992
Correct: Yes
-----
Question: who played john connor in the original terminator
Model Answer: John Connor is not physically portrayed by any actor in the original "Terminator" film released in 1984, as he is only mentioned and not seen. The character of John Connor is first
referred to in this film as the future leader of the human resistance against the machines, but he does not make an appearance. Reference Answer: Edward Furlong throughout the remainder of the film
Correct: No
-----
Question: {QUESTION}
Model Answer: {GEN ANSWER}
Reference Answer: {GOLD ANSWER}
Correct:

Table 14: KILT NQ GPT-4 optimized Judge Prompt

You might also like