0% found this document useful (0 votes)

4 views

ChatBenc

The document presents ChatBench, a new dataset designed to evaluate human-AI interactions by converting standard benchmarks like MMLU into user-AI conversations. It highlights significant discrepancies between AI-alone and user-AI performance across various subjects, revealing that traditional benchmarks do not accurately predict real-world user interactions. The study also introduces a user simulator that improves the estimation of user-AI accuracies, paving the way for more effective interactive evaluations of AI models.

Uploaded by

Adrian Kulik

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

4 views

ChatBenc

Uploaded by

Adrian Kulik

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 26

ChatBench: From Static Benchmarks to Human-AI Evaluation

Serina Chang Ashton Anderson Jake M. Hofman

Microsoft Research Microsoft Research Microsoft Research
University of California, Berkeley University of Toronto [email protected]
[email protected] [email protected]

Abstract set of questions, and for each question, they prompt

the model with the entire question text and often
With the rapid adoption of LLM-based chat-
constrain it to respond with a single multiple choice
bots, there is a pressing need to evaluate what
arXiv:2504.07114v1 [cs.CL] 22 Mar 2025

humans and LLMs can achieve together. How- option as its answer. In contrast, interactions with
ever, standard benchmarks, such as MMLU, human users are far more variable, open-ended,
measure LLM capabilities in isolation (i.e., “AI- and subject to ambiguity. Even conditioned on
alone”). Here, we design and conduct a user the same underlying intent, users may phrase their
study to convert MMLU questions into user- prompts differently, leave out information in their
AI conversations, by seeding the user with the early prompts, or rely on context in later prompts.
question and having them carry out a conver-
Robust AI models need to understand how to work
sation with the LLM to answer their question.
We release ChatBench, a new dataset with AI- with users in these contexts to provide accurate
alone, user-alone, and user-AI data for 396 information and complement human expertise.
questions and two LLMs, including 144K an- Recently, there have been efforts to evaluate how
swers and 7,336 user-AI conversations. We humans interact with LLMs, such as examining
find that AI-alone accuracy fails to predict user- real-world conversations using a strong LLM as
AI accuracy, with significant differences across a judge (Lin et al., 2024; Li et al., 2024c). How-
multiple subjects (math, physics, and moral ever, these new evaluations have been largely dis-
reasoning), and we analyze the user-AI conver-
connected from standard benchmarks, which are
sations to provide insight into how they diverge
from AI-alone benchmarks. Finally, we show widely used; for example, every LLM released by
that fine-tuning a user simulator on a subset of OpenAI, Google, and Meta, inter alia, has reported
ChatBench improves its ability to estimate user- its performance on MMLU (OpenAI, 2023; Gem-
AI accuracies, increasing correlation on held- ini Team Google, 2023; Llama Team, AI@Meta,
out questions by more than 20 points, creating 2024). This disconnect is due to a large distribution
possibilities for scaling interactive evaluation.1 shift between benchmark questions and questions
1 Introduction asked by real-world users, missing the user’s true
intent, and missing ground-truth labels to judge the
In 2024, nearly 40% of US adults reported using interaction, necessitating techniques like LLM-as-
generative AI in their everyday lives, an unprece- judge. As a result, it is difficult to directly compare
dented rate of adoption for a new technology (Bick results from standard benchmarks to real-world
et al., 2024). As these models, particularly large interactions or to understand how incorporating
language models (LLMs), become more integrated interactions changes evaluation insights.
into our lives, it becomes increasingly important Here, we seek to bring these lines of research
to evaluate them based on not only their capabil- closer together by directly converting benchmarks
ities in isolation, but also their interactions with into user-AI conversations. We focus on MMLU,
humans. However, there is a large gap between as one of the most widely used benchmarks, and
human interactions and how standard benchmarks, design a user study where we seed users with an
such as Massive Multitask Language Understand- MMLU question and have them carry out a conver-
ing (MMLU), evaluate models (Hendrycks et al., sation with an LLM with the intent of answering
2021). These benchmarks test models on a fixed that question. For each question, we test the LLM
Our dataset ChatBench is available at https://
1 in isolation (i.e., “AI-alone”) and evaluate the ac-
huggingface.co/datasets/microsoft/ChatBench. curacy of a user interacting with the LLM (i.e.,

1
“user-AI”); furthermore, we also gather “user-alone” 2 Related Work
data per question to understand how much users
Benchmarks. In this work, we focus on MMLU
improve with the LLM. This parallel data has two
as one of the most commonly used LLM bench-
advantages: first, we can now conduct an apples-
marks (Hendrycks et al., 2021). MMLU is a
to-apples comparison of AI-alone performance, as
question-answering (QA) dataset, consisting of
reported in most papers, vs. user-AI performance
multiple choice questions across 57 subjects (which
on the same questions, so that we can isolate the
we discuss in detail in Section 3.2). We also draw
effects of incorporating interaction into evaluation.
on the efforts of MMLU-Redux (Gema et al., 2024),
Second, recent works have explored the possibility
where authors noted some quality concerns in the
of simulating the user in user-AI conversations (Li
original MMLU, so they sampled a large number
et al., 2024a) but lack sufficient data for training
of MMLU questions and manually annotated them
and testing. Our approach of “seeding” users with
for errors. While we conduct our user study on
a question corresponds naturally to a new way to
MMLU, our approach of converting QA bench-
initialize user simulators, and the large-scale data
marks to a user-AI conversation is general, and
we collect enables fine-tuning and validating a user
could be applied to other QA benchmarks, such as
simulator on this task, improving the trustworthi-
HotPotQA (Yang et al., 2018) or GSM8K (Cobbe
ness of simulations for AI evaluation.
et al., 2021), as well as adapted to non-QA tasks.
Our resulting dataset ChatBench, which we re-
lease publicly, consists of AI-alone, user-alone, Evaluating human-AI interactions. Recently,
and user-AI data for 396 questions and two LLMs there have been growing efforts to evaluate AI mod-
(GPT-4o and Llama-3.1-8b), with 144K answers els based on their interactions with humans. For ex-
and 7,336 user-AI conversations. Our study de- ample, some works gather real-world interactions
sign also includes two user-AI conditions—where (e.g., WildChat (Zhao et al., 2024), ChatbotArena
the user attempts the question first on their own (Chiang et al., 2024)) and evaluate the interactions
vs. uses AI directly—to explore nuances in user (e.g., WildBench (Lin et al., 2024), ArenaHard
behavior. Our study reveals that AI-alone accuracy (Li et al., 2024c), MT-Bench (Zheng et al., 2023),
fails to predict user-AI accuracy, with significant LMSYS-Chat-1M (Zheng et al., 2024)), typically
differences across multiple subjects (math, physics, using a strong LLM as a judge. However, as dis-
and moral reasoning). We also analyze the user-AI cussed before, it is difficult to directly compare
conversations to understand where user-AI inter- these evaluation results to standard benchmarks,
actions are diverging from AI-alone benchmarks. due to the lack of ground-truth user intents and in-
Our contributions are summarized as follows: teraction labels, distribution shift in questions, and
change in evaluation metric. Other works have eval-
• We design and conduct a user study to convert uated human-AI interactions in diverse contexts,
MMLU questions into user-AI conversations such as theorem proving (Collins et al., 2024), ed-
and release a large-scale dataset ChatBench. ucation (Jurenka et al., 2024), co-writing with AI
(Dhillon et al., 2024), and collaborating with AI
• We show that AI-alone accuracy fails to pre-
agents (Shao et al., 2024), and sought to understand
dict user-AI accuracy, across subjects, mod-
where human-AI combinations outperform either
els, AI-alone methods, and user-AI conditions,
alone (Bansal et al., 2021; Vaccaro et al., 2024).
and we analyze user-AI conversations to un-
Our work builds on Lee et al. (2023), who argue
derstand where AI-alone and user-AI diverge.
for the need to evaluate human-LM interactions,
• We develop a new user simulator that mimics covering five types of tasks including QA. Their
our user study task and show that fine-tuning work includes an exploratory user study where they
our simulator on ChatBench improves its cor- have users interactively answer MMLU questions;
relation with real user-AI accuracies by 22-26 however, they only test 30 questions and do not
points and outperforms baselines. explore simulation. Our study builds on theirs by
testing 396 questions, at a large enough scale to
All together, our work helps to reconcile two vital estimate significant effects and fine-tune a user sim-
lines of research in AI evaluation, revealing how ulator, and introduces an AI-alone method that is a
interactions change evaluation insights and paving far more competitive baseline for estimating user-
the way towards scalable interactive evaluation. AI results. Furthermore, our study tests more so-

2
phisticated LLMs, complex reasoning subjects, and Phase 1
per question
user-AI effects across levels of question difficulty Report
confidence
and user-AI conditions. While different in domain,
User-alone
our work is also related to Li et al. (2024b), who answer
convert medical benchmarks into simulated inter-
actions between a patient and an expert. Phase 2
per question
Report
Simulation with LLMs. LLMs have shown confidence
direct-to-AI answer-first
promising capabilities to realistically simulate hu-
User-AI User-alone
man behaviors, such as responses to surveys and so- answer answer
cial science experiments (Argyle et al., 2023; Hor- User-AI
ton, 2023; Hwang et al., 2023; Hewitt et al., 2024; answer
Suh et al., 2025) or interactions between humans
(Park et al., 2023; Chang et al., 2024). There is also Feedback

much interest in developing LLM-based user simu-

Figure 1: Flow of our user study.
lators to scale AI evaluation and training (Dubois
et al., 2023; Ren et al., 2024; Kong et al., 2024;
Li et al., 2024a). However, LLMs can sometimes
produce unrealistic simulations of humans, with
risks of bias or uniformity (Cheng et al., 2023a,b;
Bisbee et al., 2024; Wang et al., 2025). Thus, there
is a need to rigorously test whether LLM simula-
tors produce realistic outputs and match insights
that we would learn from real humans. Here, we
examine a setting with well-defined simulator goals
(i.e., does the simulator match user behavior and
accuracy in real user-AI conversations) and release
a large-scale dataset that enables training and vali-
dation of simulators in this setting.
Figure 2: Screenshot from Phase 2 where the user inter-
acts with an AI Chatbot to answer the question.
3 User Study Design
In this section, we discuss our user study design,
additional question-level variable allows us to ana-
including the task flow and interface, how we se-
lyze how AI assistance helps users across varying
lected questions, and data collection. We provide
levels of confidence. After Phase 2, all users pro-
additional details in Appendix A.
vide feedback on the task, with free-text responses
3.1 Task Flow and Interface including whether they found the AI Chatbot help-
ful and if they saw it make any mistakes. In Figure
Figure 1 shows the flow of our user study. In Phase
2, we show a screenshot of what users see in Phase
1, users are asked to answer each question to the
2; in the Appendix, we provide screenshots of all
best of their ability on their own. In Phase 2, users
other pages in our task (Figures A2-A9).
are asked to chat with an unnamed “AI Chatbot”
to help them answer their question. We test two Conditions. We explore two user-AI conditions:
LLMs, contrasting GPT-4o as a strong model and answer-first and direct-to-AI. In the answer-first
Llama-3.1-8b as a relatively weaker model. We re- condition, the user attempts to answer each Phase
quire interaction in Phase 2—the user cannot move 2 question on their own first before answering with
onto the next question without sending a message AI, but in the direct-to-AI condition, they have im-
and we say that low-effort conversations, e.g., only mediate access to AI for the Phase 2 questions (in
“hi”, will be flagged—but otherwise, we do not both conditions, Phase 1 is all user-alone). The
specify at all how the user should interact with the advantage of answer-first is that, for the same ques-
AI Chatbot. In both phases, users are asked to first tion, we can record a user’s answer on their own
report how confident they are about approaching vs. with AI, allowing us to estimate the marginal
the problem, before attempting to answer it. This impact of AI more precisely (i.e., within-subjects),

3
Elementary Mathematics High School Math and 51% on College Math.2
A number is rounded to the nearest hundred. The result is
500. Which number could not be the number before it Furthermore, the three math datasets stratify dif-
was rounded to the nearest hundred? ferent levels of difficulty for humans, allowing us
A. 458 B. 463 C. 547 D. 559
to explore how user-AI effects change across diffi-
Conceptual Physics culty levels. We also include Conceptual Physics
To say that a rose petal is red is to say that it and Moral Scenarios as two other reasoning do-
A. absorbs red B. reflects red
C. emits red D. transmits red
mains with very different types of problems and
differing levels of human expertise. In Figure 3, we
Moral Scenarios provide examples of questions from these datasets,
For which of these two scenarios does the main
character do something clearly morally wrong?
showcasing their diversity.
Scenario 1 | I embraced my sister's newly widowed husband to To aid with question selection, we use the annota-
show him sympathy.
Scenario 2 | I came home late one night to find that I had lost
tions from MMLU-Redux (Gema et al., 2024). The
my house keys, so I had to break into my neighbor's house. authors recognized occasional quality issues with
A. Wrong, Wrong B. Wrong, Not Wrong
C. Not Wrong, Wrong D. Not Wrong, Not Wrong
the original MMLU, so for each MMLU dataset,
they sampled 100 questions from the test set uni-
formly at random and labeled them for errors.
Figure 3: Examples of questions from our user study.
While they found many errors in some datasets
(e.g., Virology), the majority of the questions (92%-
while for direct-to-AI, the set of user-alone answers 99%) in the datasets we chose passed their review.
and user-AI answers for a given question come As a second layer of quality control, we also ran
from different users (i.e., between-subjects). How- OpenAI’s advanced reasoning o1 model over the
ever, we hypothesized that user behavior and accu- 100 questions and manually checked the questions
racy in the user-AI stage could be impacted by the that o1 did not get correct. We kept the intersection
user attempting the answer first, reducing ecologi- of questions that passed MMLU-Redux’s inspec-
cal validity if we believe users typically go directly tion and ours (with o1’s help).
to AI in the real world. Thus, we keep both condi- Batches. To reduce variance in the number of
tions, allowing us to test our hypothesis and explore answers that each question received, we organized
nuances in user behavior. the questions into batches and selected a random
batch per user, instead of selecting each question
Incentivization. To incentivize participants in randomly. For the math questions, each batch con-
our study to answer questions correctly, we in- sisted of 5 elementary, 5 high school, and 2 college
cluded a small bonus of $0.10 per correct answer, questions. We included fewer college questions
on top of a base compensation of $5.00 for complet- since we found in pilots that college questions were
ing the entire task. We included these incentives too difficult for most users, so they tended to defer
to improve ecological validity, since our study is to the LLM’s first answer without much interaction.
meant to capture how a real-world user would in- Based on the number of questions that passed in-
teract with an AI system if they have a question spection, we were able to create 19 math batches,
that they genuinely want to answer. In Appendix with 95 elementary, 95 high school, and 38 col-
A.1, we discuss pilots we ran with and without in- lege questions in total. For Conceptual Physics and
centivization, as well as how we mitigated risks of Moral Scenarios, we constructed 7 batches of size
cheating with external tools. 12, resulting in 84 questions for each subject.

3.2 Question Selection 3.3 Data Collection

We consider five datasets from MMLU for our ex- We recruited workers on Prolific to participate as
periments: Elementary, High School, and College users in our study (see eligibility criteria in Ap-
Mathematics, Conceptual Physics, and Moral Sce- pendix A). For our full pre-registered study, we re-
narios. We include three math datasets since this cruited 650 workers, and we also ran two medium-
subject still poses unique challenges for LLMs: sized pilots (100 workers without incentives and 60
for example, the HELM leaderboard (Liang et al., workers with incentives). When a user began the
2023) reports that while GPT-4o’s mean accuracy 2
https://crfm.stanford.edu/helm/mmlu/latest/#/
on MMLU is 84%, its accuracy is only 48% on leaderboard.

4
study, they were randomly assigned to one of the format. The method, which we call free-text, is
three subjects (60% probability for math, 20% for very simple: (1) prompt the evaluated model with
conceptual physics, and 20% for moral scenarios) the concatenated question text and answer options,
and assigned uniformly at random to one of that without any additional instructions, (2) use GPT-
subject’s question batches, one of the two user-AI 4o to extract an answer (if any) from the response.
conditions, and one of the two models (GPT-4o We include the full prompts for all three AI-alone
and Llama-3.1-8b). Within the question batch, 3 methods in Listings 1-4.
questions were randomly assigned to Phase 1 and We ran these three AI-alone methods on the two
9 to Phase 2. We also included an attention-check models and all 396 questions from our user study,
question for every user, which we found the vast gathering 50 answers per model and question. As
majority (over 99%) of users passed. shown in Figure 4, our few-shot letter-only results
Compiling data over the three runs, we have for GPT-4o approximately match those reported on
10,828 confidence answers, 7,148 user-alone an- the HELM leaderboard per dataset (which is also
swers, and 7,336 user-AI answers and conversa- few-shot letter-only, but uses the entire MMLU test
tions in ChatBench (see Table A3 for additional sets). While prior work, like HELM, often uses
data statistics). While we include data from all temperatures of 0 for multiple choice QA, we used
three runs in ChatBench to provide a larger re- a temperature of 0.7, since we wanted to perfectly
source for the community, for our analyses in the match the model parameters used in the user study,
rest of the paper, we only use data from the workers and 0.7 is a more realistic temperature for real-
in our full pre-registered study so that populations world AI chatbots.
within our analysis are entirely comparable.
4.2 AI-Alone vs. User-AI
4 Experimental Results Dataset-level accuracy. We visualize our main
results in Figure 4, which shows mean accuracy
In this section, we describe our experimental re-
per model and dataset, over user-alone (red), user-
sults, including how we conducted AI-alone ex-
AI (purple), and AI-alone (blue). First, we see
periments, comparisons of AI-alone vs. user-AI
that few-shot letter-only (light blue) is a very poor
results, and analyses of the user-AI conversations.
predictor of user-AI performance, with a mean ab-
For our main results comparing AI-alone vs. user-
solute deviation of 21 percentage points, averaged
AI, we preregistered our analyses on AsPredicted.3
over the 10 dataset and model pairs. With a few
We provide additional results and methodological
exceptions—specifically Conceptual Physics for
details (e.g., statistical tests) in Appendix B.
Llama-3.1-8b and College Mathematics and Moral
4.1 AI-Alone Experiments Scenarios for GPT-4o—all differences are statisti-
cally significant. Results are similar for zero-shot
Our goal in this work is to understand how eval-
letter-only, which we report in Tables A1-A2. No-
uation conclusions change when we move from
tably, our AI-alone method, free-text (dark blue), is
AI-alone to user-AI settings. However, even for
a much better predictor of user-AI accuracy, reduc-
a fixed benchmark, there can be multiple ways to
ing the mean absolute deviation to 10 percentage
evaluate an LLM on its own. First, we try letter-
points. However, it still differs significantly from
only methods, which require the model to answer
user-AI performance, notably for Moral Scenar-
with only a single letter corresponding to the se-
ios with Llama-3.1-8b and for all datasets except
lected answer option (“A” through “D”). This is
Moral Scenarios with GPT-4o.
the method used by Lee et al. (2023), along with
Our results also reveal the complexity of com-
various leaderboards, such as HELM (Liang et al.,
bining humans and AI, as the size of gaps and or-
2023), to standardize the answer format. We try
dering between user-alone, user-AI, and AI-alone
two letter-only variants, zero-shot and few-shot,
vary over models and datasets. For example, for
where we prepend the 5 examples from the MMLU
the math datasets, GPT-4o performs quite well on
“dev” set to the prompt as in-context examples.
its own (using free-text), while humans struggle on
We also introduce a more realistic AI-alone tech-
their own, especially for high school and college.
nique which serves as a better proxy for user ex-
In these cases, user-AI accuracy is between the two,
perience by not constraining the model’s response
significantly better than user-alone and significantly
3
https://aspredicted.org/n84n-sn3f.pdf. worse than AI-alone. Meanwhile, Llama-3.1-8b

5
Figure 4: Mean accuracy per model and dataset, comparing user-alone (red), user-AI (purple), AI-alone free-text
(dark blue), and AI-alone letter-only few-shot (light blue). See Tables A1-A2 for numbers and statistical tests.

performs significantly worse than GPT-4o on the curacy, since it may be more reasonable to ex-
math datasets, but we do not see a further drop in pect AI-alone to predict the improvement the user
performance from AI-alone to user-AI. In the fol- makes with AI assistance, instead of the overall
lowing section, we uncover counterveilling factors accuracy. However, the correlations remain low,
that explain these results: on one hand, users in- at r = 0.26 for direct-to-AI and r = 0.27 for
troduce ambiguity compared to AI-alone methods, answer-first, showing that AI-alone cannot predict
which include the entire question text and answer improvements well either. Finally, we fit a linear
options; on the other hand, users can sometimes model to try predicting a question’s user-AI accu-
recognize mistakes in AI reasoning, of which there racy from its user-alone and AI-alone accuracies.
are more for Llama-3.1-8b. Finally, our results re- The fitted model yields a correlation of 0.55 for
veal that even when AI-alone benchmarks report predicting answer-first accuracies and 0.63 for pre-
a large gap in performance between two models, dicting direct-to-AI accuracies, demonstrating that
this gap can become much smaller after incorpo- user-AI accuracy also cannot be reliably predicted
rating user interactions. Comparing GPT-4o and from user-alone and AI-alone accuracy.
Llama-3.1-8b, their average gap in AI-alone free-
text accuracy is 25 percentage points, but this gap 4.3 Characterizing User-AI Conversations
shrinks to less than 10 percentage points in user-AI Our summary results show that user-AI accuracies
interactions (9 percentage points for direct-to-AI are significantly different from AI-alone accuracies.
and 5 percentage points for answer-first). To better understand what drives these differences
we use a separate LLM as an annotator to charac-
Question-level accuracy. Besides mean accu- terize the user-AI conversations. For each user-AI
racy, we can also measure the correlation in per- conversation, we gather the full log of the conversa-
question accuracies. We find that the Pearson cor- tion and its associated metadata (e.g., the question
relation between AI-alone free-text and user-AI ID, the correct answer, the user’s selected answer,
is only r = 0.45 for direct-to-AI and r = 0.46 etc.), and prompt a separate instance of GPT-4o
for answer-first. While correlations may be lower to use this information to extract the answers to
because per-question user-AI accuracies are imper- several classification questions: whether the first
fectly measured, the free-text correlation is still substantive user prompt was a question, whether
well below what we would expect if user-AI ac- the first user question was a near-exact rephrasing
curacies were drawn from the same distribution of the original question or one of several other pos-
as free-text, which would range from r = 0.88 to sibilities, and whether the first and last AI answers
0.94 (Section B.2). were correct (Listing 5).
We also examine the correlation with per- How often does the conversation follow what
question differences in user-AI and user-alone ac- we might expect if AI benchmarks were faithful

6
sults in the wrong answer. This could happen either
if the user decided to ignore the AI’s correct answer
(e.g., if they believed they knew the answer or due
to lack of effort) or if the change from AI-alone
prompting to user prompting resulted in the AI no
longer providing the correct answer. We find much
more evidence for the latter. Among over 300 of
these interactions with GPT-4o, the model only
provided the correct answer in 36% of interactions,
and in the remaining interactions, the model either
did not provide any clear answer (43%) or provided
a wrong answer (21%). Among 116 interactions
with Llama-3.1-8b (there are fewer, since there are
Figure 5: Fraction of user-AI interactions that mirror AI fewer questions where Llama-3.1-8b achieves per-
benchmark, by subject and model. fect accuracy on its own), the model only provided
the correct answer in 23% of interactions, instead
providing the wrong answer in 43% and no clear
proxies of human-AI interaction? We say a con- answer in 34% of interactions.
versation mirrors an AI benchmark if (1) the user’s We also find that in the majority of these interac-
first substantive prompt is a near-exact rephrasing tions (69% for GPT-4o and 58% for Llama-3.1-8b),
of the benchmark question (otherwise the user is in- the user’s first substantive prompt is not a near-
jecting their own knowledge or information into the exact rephrasing of the benchmark question, pro-
interaction), (2) the LLM responds with an answer, viding further evidence for our hypothesis that the
and (3) the user submits that answer. In Figure 5, change in accuracy is largely due to the shift in
we see that only 34% of all interactions mirror AI prompting from AI-alone benchmarks to human
benchmarks, revealing the extent to which user-AI users. We find that a primary source of divergence
interactions diverge from AI benchmarks. is the user asking a related but different question,
Using data from the answer-first condition also which is often ambiguous (e.g., leaving out critical
reveals that AI helps humans more often than it hin- information for a math problem).
ders them. When the same user answers a question
first without AI and then with AI assistance, more Cases where interaction corrects AI errors.
than half (54%) of incorrect user-alone answers Next, we study the opposite scenario: questions
are corrected with AI support, while only 10% of where AI-alone free-text’s accuracy is poor (below
correct user-alone answers turn incorrect with AI 10% over 50 trials) but the user-AI interaction ar-
assistance. This data also allows us to look more rives at the correct answer. We find that in nearly
closely at how often user-AI interactions improve half of these interactions, the AI model actually
on AI-alone performance. Compared to AI-alone provided the correct answer to the user (49% for
free-text, user-AI accuracy is lower for 57% and GPT-4o and 43% for Llama-3.1-8b), suggesting
higher for 15% of questions for GPT-4o, and lower that the user’s prompting enabled the AI to arrive
for 39% and higher for 44% of questions for Llama- at the correct answer, even though it could not on
3.1-8b (keeping questions where we have at least 5 its own. Notably, we find that in around 10% of
user-AI answers for the AI system). Thus, we see these interactions, over the course of the interaction
effects in both directions, and there are certainly the AI improves, from providing either an incorrect
cases where user-AI improves on AI-alone, espe- answer or no clear answer to providing the cor-
cially for the weaker model. Below, we analyze rect answer, demonstrating the possibility for AI to
both types of cases in more detail. improve through user interaction and the need for
multi-turn analysis beyond static benchmarks.
Cases where interaction introduces errors. Even when the AI is not able to arrive at the
First, we study cases where user interaction incorrect answer, we find that users are sometimes
troduces errors, by focusing on questions where still able to correct the mistake and select the right
AI-alone free-text is always correct (accuracy of answer. We visualize the rates of these corrections
100% over 50 trials) but the user-AI interaction re- in Figure B1, showing that this occurs especially

7
Simulator Task 1 dard response is the real user’s first utterance; k − 1
You are a human examples in the Task 2 format where the gold stan-
Generate the first prompt you would
user interacting
say to the system to get started with
with an AI system,
answering your question.
dard response is each of the remaining utterances
and you are trying
to answer the
(providing the conversation up to that utterance);
Simulator Task 2 and one example in the Task 2 format with the full
following question:
A number is rounded Here is your conversation so far with conversation and the gold standard response being
to the nearest the AI system:
hundred. The result is
===================
“Answer: LETTER” corresponding to the user’s
500. Which number
YOU: […] selected multiple choice option.
could not be the
SYSTEM: […]
number before it was
rounded to the ===================
nearest hundred? If your question is answered by this 5.2 User Simulator Experiments
A. 458
conversation, return ONLY the
answer in the format "Answer: For these experiments, we use GPT-4o as our simu-
B. 463
<letter>". If not, generate the next lator. We try four baselines: the two AI-alone meth-
C. 547
prompt you would say to the system
D. 559 ods, the two-step simulator without fine-tuning,
to answer your question.
and the user simulator from IQA-EVAL (Li et al.,
Figure 6: Example of prompts to our two-step user 2024a). Their simulator, designed with prompt en-
simulator, using one of the example questions from gineering, receives a prompt consisting of a role
Figure 3. See Listings 6-8 for complete prompts. description (“You are mimicking a human.”), a task
description (“You are trying to choose the correct
answer for the given question.”), and discussion
often with the weaker model, Llama-3.1-8b. instructions (e.g., “In each turn, please only ask
one sub-question to interact with the assistant.”);
5 Simulating User-AI Conversations
see Listing 9 for the full prompt. We compare these
From our user study, we showed that incorporat- baselines to our model, the two-step simulator fine-
ing user interactions significantly changes evalua- tuned on ChatBench (“ChatBench-Sim”).
tion conclusions, compared to AI-alone evaluation. In our fine-tuning experiments, we randomly
However, data from human users is costly and time- split the questions from our user study into 60% for
consuming to collect, motivating the development training (n = 237) and withheld 40% for testing
of a user simulator to scale interactive evaluation. (n = 159), and we fine-tuned on all user-AI con-
In this section, we describe our user simulator and versations for the train questions. For all three sim-
present experimental results. ulator methods, we test them on the held-out test
questions by generating conversations entirely from
5.1 Fine-Tuning a User Simulator scratch, given only the question (in contrast, an eas-
We define a new user simulator that we can fine- ier but less realistic set-up would be to provide the
tune on our collected user data, by mimicking the real conversation up to the nth turn and have the
experience of users in our study. First, we seed simulator generate the next user utterance).
the user simulator with the MMLU question, as we
Evaluation metrics. We generate 10 simulator-
did with human users in our study, and we tell the
AI conversations per test question and compare to
simulator to interact with an AI system to answer
real user-AI conversations for the same question
its question (Figure 6, left). Then, we break the
and AI system. To evaluate whether accuracies are
simulator’s task into two subtasks: (1) when there
similar, we measure the correlation and mean abso-
is no conversation yet, we prompt the simulator to
lute error (MAE) between simulator-AI vs. user-AI
generate its first prompt as a user (Figure 6, top
accuracies, keeping test questions where we have
right), (2) given the conversation so far, we prompt
at least 5 user-AI answers (n = 132 and n = 124
the simulator to either answer the question in the
for GPT-4o and Llama-3.1-8b, respectively). To
form “Answer: LETTER”, if the question has been
evaluate whether the simulator’s generated utter-
answered by the conversation, or if not, generate
ances are realistic, we measure the average BLEU
the next user prompt (Figure 6, bottom right).
and ROUGE scores of the simulator’s first prompt
We then transform the real user-AI conversa-
compared to the real user’s first prompt.
tions from our study into training examples for
supervised fine-tuning. Each conversation with k Results. As shown in Table 1, fine-tuning our
user utterances yields k + 1 training examples: one simulator yields large gains, with a 22-26 point
example in the Task 1 format where the gold stan- increase in correlation and a 21-26% decrease in

8
AI: GPT-4o AI: Llama-3.1-8b
Type Method Corr. ↑ MAE ↓ BLEU ↑ ROUGE ↑ Corr. MAE BLEU ROUGE
AI-alone Letter-only few-shot 0.30 0.31 – – 0.21 0.40 – –
AI-alone Free-text 0.49 0.20 – – 0.61 0.20 – –
Sim-AI IQA-EVAL 0.50 0.18 0.085 0.311 0.43 0.22 0.086 0.313
Sim-AI Two-Step 0.41 0.19 0.102 0.347 0.39 0.23 0.102 0.346
Sim-AI ChatBench-Sim 0.63 0.15 0.261 0.460 0.65 0.17 0.258 0.457

Table 1: Comparing to user-AI conversations: AI-alone methods, IQA-EVAL (Li et al., 2024a), and the two-step
simulator before (Two-Step) and after fine-tuning on ChatBench (ChatBench-Sim). Top-performing is bolded.

MAE. As shown in Figures B3-B4, a primary fail- those decisions might change after taking into ac-
ure mode of the simulator before fine-tuning is that count human interactions. We also hope to expand
it cannot replicate human mistakes and greatly over- our analysis to more benchmarks and non-QA tasks.
estimates user-AI performance, producing far more Finally, we hope to develop training techniques to
questions with accuracies of 1.0 than we see in the build even more realistic user simulators: while we
real user-AI distribution, while the fine-tuned sim- see large gains from fine-tuning on ChatBench, the
ulator matches the real distribution more closely. best correlations only reach 0.63, leaving room for
We also find that fine-tuning improves the real- future improvement and innovation.
ism of the simulator’s generated utterances, with
11-16 point improvements in BLEU and ROUGE. 7 Limitations
The fine-tuned simulator also outperforms both AI-
Our work has several limitations, which we tried
alone methods and IQA-EVAL across metrics.
to mitigate but should be taken into consideration
when interpreting the results.
6 Conclusion
Coverage. Our user study has limited coverage
We have shown that evaluation conclusions change
of possible benchmarks and user tasks. We chose to
significantly from AI-alone benchmarks to user-AI
focus on the MMLU benchmark (Hendrycks et al.,
interactions, across question domains, AI models,
2021) and question-answering as our task, since
AI-alone methods, and user-AI conditions. Our
MMLU is one of the most popular LLM bench-
results motivate the need for more realistic eval-
marks and it covers a wide range of subjects, so
uations of AI models that incorporate user inter-
we could test multiple subjects in comparable ways
actions. However, this goal is difficult to achieve,
and with minimal changes to our user study. We be-
as user data is expensive to collect. To make this
gan with question-answering since we can naturally
goal more feasible, we both release a new large-
transform a benchmark question into a user-AI con-
scale dataset of user interactions, ChatBench, and
versation, where the user is trying to answer the
demonstrate the potential of building user simula-
question. However, future work should investigate
tors to scale interactive evaluation.
whether results are consistent on other benchmarks
The changes we see from AI-alone to user-AI
and/or tasks, especially more open-ended genera-
accuracies are often large enough to affect quali-
tion tasks that are common in real-world user-AI
tative conclusions about the models. For example,
interactions (Zhao et al., 2024).
what can seem like a large disparity between mod-
els on AI-alone benchmarks (e.g., 25 percentage Ecological validity. Our user study is meant to
point gap between GPT-4o and Llama-3.1-8b on capture how a user would act if they have a ques-
free-text) can shrink to much smaller gaps after tion in mind and they are interacting with an AI
incorporating user interactions (e.g., 5 point gap system to answer their question. However, since
for answer-first). These changes could impact real- we wanted to match the user’s underlying question
world decisions, such as which model to deploy with the MMLU questions, we had to tell the user
(e.g., a lightweight, on-device model that performs what question to answer, which could lead to differ-
only slightly worse than a much larger off-device ent behavior compared to if they were intrinsically
model might be preferable in some circumstances). motivated to answer a question. To mitigate this,
To this end, in future work we hope to under- we included a small incentive ($0.10 per correct
stand how AI-alone benchmarks are currently used answer), so that they would try to get the correct
to make decisions (Hardy et al., 2024) and how answer, and we filtered out users who failed the at-

9
tention check; however, it is still possible that users’ Acknowledgments
behaviors would be different in the real world. Our
study setting was also different from real world The authors thank Rich Ciapala for computing sup-
question-answering: we recruited workers on Pro- port and Emma Pierson for helpful discussions.
lific to do our study, where they answered 13 ques-
tions consecutively in our interface. Still, we tried
References
to match real-world settings, such as choosing mod-
els they might interact with in the real world (e.g., Lisa P. Argyle, Ethan C. Busby, Nancy Fulda, Joshua R.
GPT-4o), using realistic model parameters (e.g., Gubler, Christopher Rytting, and David Wingate.
2023. Out of one, many: Using language models to
temperature of 0.7), and not guiding their prompts simulate human samples. Political Analysis, 31:337–
to the AI system at all, besides requiring at least 351.
one interaction per question.
Gagan Bansal, Tongshuang Wu, Joyce Zhou, Ray-
mond Fok, Besmira Nushi, Ece Kamar, Marco Tulio
8 Broader Impacts and Ethical Ribeiro, and Daniel Weld. 2021. Does the whole
Considerations exceed its parts? the effect of ai explanations on
complementary team performance. In CHI ’21: Pro-
Our work is driven by broader impacts: we seek ceedings of the 2021 CHI Conference on Human
to make AI evaluation more realistic and human- Factors in Computing Systems, 81, pages 1–16.
centered, by investigating how evaluation conclu-
Alexander Bick, Adam Blandin, and David Deming.
sions change when we incorporate human interac- 2024. The rapid adoption of generative AI. Federal
tions. With our carefully designed user study, we Reserve Bank of St. Louis.
show that evaluation conclusions change signifi-
cantly from AI-alone to user-AI settings (for the James Bisbee, Joshua D. Clinton, Cassy Dorff, Brenton
Kenkel, and Jennifer M. Larson. 2024. Synthetic
same set of questions), and these results hold over replacements for human survey data? the perils of
different subject areas, AI models, AI-alone meth- large language models. Political Analysis, 32:401–
ods, and user-AI conditions. We hope that our work 416.
motivates AI researchers and practitioners to think
more carefully about human-AI interactions when Serina Chang, Alicja Chaszczewicz, Emma Wang,
Maya Josifovska, Emma Pierson, and Jure Leskovec.
they evaluate AI systems, instead of only using 2024. LLMs generate structurally realistic social net-
AI-alone benchmarks. works but overestimate political homophily. arXiv
The direction of evaluating human-AI interac- preprint arXiv:2408.16629.
tions also raises some ethical considerations. First,
Myra Cheng, Esin Durmus, and Dan Jurafsky. 2023a.
we should seek to recruit diverse human partici- Marked personas: Using natural language prompts to
pants, since an AI system that works well for one measure stereotypes in language models. In Proceed-
individual or group may not work well for another ings of the 61st Annual Meeting of the Association
(e.g., depending on ability, language, preferences, for Computational Linguistics (ACL’23).
etc.). Second, user studies should be run ethically: Myra Cheng, Tiziano Piccardi, and Diyi Yang. 2023b.
participants should be paid fairly, they should pro- Compost: Characterizing and evaluating caricature in
vide informed consent about how their data will LLM simulations. In Proceedings of the 2023 Con-
be used, and their data should be anonymized and ference on Empirical Methods in Natural Language
Processing (EMNLP’23).
personal information removed (e.g., if they tell the
AI system their name). Third, the possibility of Wei-Lin Chiang, Lianmin Zheng, Ying Sheng, Anasta-
simulating humans in human-AI interactions is ex- sios Nikolas Angelopoulos, Tianle Li, Dacheng Li,
citing and could make interactive evaluation feasi- Hao Zhang, Banghua Zhu, Michael Jordan, Joseph E.
ble at scale, but LLM-based simulations of humans Gonzalez, and Ion Stoica. 2024. Chatbot Arena: An
open platform for evaluating LLMs by human prefer-
also have risks that need to be addressed, such as ence. arXiv preprint arXiv:2403.04132.
their possibilities for stereotyping, bias, and flat-
tening populations (Cheng et al., 2023b,a; Wang Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian,
et al., 2025). Researchers hoping to build and de- Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias
Plappert, Jerry Tworek, Jacob Hilton, Reiichiro
ploy user simulators should extensively probe for Nakano, Christopher Hesse, and John Schulman.
such biases, especially if user demographics are 2021. Training verifiers to solve math word prob-
provided in simulator prompts. lems. arXiv preprint arXiv:2110.14168.

10
Katherine M. Collins, Albert Q. Jiang, Simon Frieder, Irina Jurenka, Markus Kunesch amd Kevin R. McKee,
Lionel Wong, Miri Zilka, Umang Bhatt, Thomas Daniel Gillick, Shaojian Zhu, Sara Wiltberger, Shub-
Lukasiewicz, Yuhuai Wu, Joshua B. Tenenbaum, ham Milind Phal, Katherine Hermann, Daniel Kasen-
William Hart, Timothy Gowers, Wenda Li, Adrian berg, Avishkar Bhoopchand, et al. 2024. Towards
Weller, and Mateja Jamnik. 2024. Evaluating lan- responsible development of generative ai for educa-
guage models for mathematics through interactions. tion: An evaluation-driven approach. arXiv preprint
Proceedings of the National Academy of Sciences arXiv:2407.12687.
(PNAS).
Chuyi Kong, Yaxin Fan, Xiang Wan, Feng Jiang, and
Paramveer S. Dhillon, Somayeh Molaei, Jiaqi Li, Max- Benyou Wang. 2024. PlatoLM: Teaching LLMs in
imilian Golub, Shaochun Zheng, and Lionel Peter multi-round dialogue via a user simulator. In Pro-
Robert. 2024. Shaping human-ai collaboration: Var- ceedings of the 62nd Annual Meeting of the Associa-
ied scaffolding levels in co-writing with language tion for Computational Linguistics (ACL’24).
models. In Proceedings of the 2024 CHI Conference
on Human Factors in Computing Systems (CHI’24). Mina Lee, Megha Srivastava, Amelia Hardy, John
Thickstun, Esin Durmus, Ashwin Paranjape, Ines
Yann Dubois, Xuechen Li, Rohan Taori, Tianyi Zhang, Gerard-Ursin, Xiang Lisa Li, Faisal Ladhak, Frieda
Ishaan Gulrajani, Jimmy Ba, Carlos Guestrin, Percy Rong, Rose E. Wang, Minae Kwon, Joon Sung
Liang, and Tatsunori B. Hashimoto. 2023. Alpaca- Park, Hancheng Cao, Tony Lee, Rishi Bommasani,
farm: A simulation framework for methods that learn Michael Bernstein, and Percy Liang. 2023. Evaluat-
from human feedback. In Proceedings of the 37th ing human-language model interaction. Transactions
International Conference on Neural Information Pro- on Machine Learning Research.
cessing Systems (NeurIPS’23).
Ruosen Li, Ruochen Li, Barry Wang, and Xinya Du.
2024a. IQA-EVAL: Automatic evaluation of human-
Aryo Pradipta Gema, Joshua Ong Jun Leang, Giwon model interactive question answering. In Proceed-
Hong, Alessio Devoto, Alberto Carlo Maria Man- ings of the 38th International Conference on Neural
cino, Rohit Saxena, Xuanli He, Yu Zhao, Xiaotang Information Processing Systems (NeurIPS’24).
Du, Mohammad Reza Ghasemi Madani, Claire Bar-
ale, Robert McHardy, Joshua Harris, Jean Kad- Shuyue Stella Li, Vidhisha Balachandran, Shangbin
dour, Emile van Krieken, and Pasquale Minervini. Feng, Jonathan S. Ilgen, Emma Pierson, Pang Wei
2024. Are we done with MMLU? arXiv preprint Koh, and Yulia Tsvetkov. 2024b. Mediq: Question-
arXiv:2406.04127. asking LLMs and a benchmark for reliable interac-
tive clinical reasoning. In Proceedings of the 38th
Gemini Team Google. 2023. Gemini: A family of International Conference on Neural Information Pro-
highly capable multimodal models. arXiv preprint cessing Systems (NeurIPS’24).
arXiv:2312.11805.
Tianle Li, Wei-Lin Chiang, Evan Frick, Lisa Dunlap,
Amelia Hardy, Anka Reuel, Kiana Jafari Meimandi, Banghua Zhu, Joseph E. Gonzalez, and Ion Sto-
Lisa Soder, Allie Griffith, Dylan M. Asmar, Sanmi ica. 2024c. From live data to high-quality bench-
Koyejo, Michael S. Bernstein, and Mykel J. Kochen- marks: The arena-hard pipeline. https://lmsys.
derfer. 2024. More than marketing? on the informa- org/blog/2024-04-19-arena-hard/.
tion value of ai benchmarks for practitioners. arXiv
preprint arXiv:2412.05520. Percy Liang, Rishi Bommasani, Tony Lee, Dimitris
Tsipras, Dilara Soylu, Michihiro Yasunaga, Yian
Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Zhang, Deepak Narayanan, Yuhuai Wu, Ananya Ku-
Mantas Mazeika, Dawn Song, and Jacob Steinhardt. mar, et al. 2023. Holistic evaluation of language mod-
2021. Measuring massive multitask language under- els. Transactions of Machine Learning Research.
standing. In The Ninth International Conference on
Learning Representations (ICLR’21). Bill Yuchen Lin, Yuntian Deng, Khyathi Chandu, Faeze
Brahman, Abhilasha Ravichander, Valentina Pyatkin,
Nouha Dziri, Ronan Le Bras, and Yejin Choi. 2024.
Luke Hewitt, Ashwini Ashokkumar, Isaias Ghezae1,
WildBench: Benchmarking LLMs with challenging
and Robb Willer. 2024. Predicting results of social
tasks from real users in the wild. In arXiv preprint
science experiments using large language models.
arXiv:2406.04770.
Working Paper.
Llama Team, AI@Meta. 2024. The llama 3 herd of
John J. Horton. 2023. Large language models as sim- models. arXiv preprint arXiv:2407.21783.
ulated economic agents: What can we learn from
homo silicus? arXiv preprint arXiv:2301.07543. Evan Miller. 2024. Adding error bars to evals: A statis-
tical approach to language model evaluations. arXiv
EunJeong Hwang, Bodhisattwa Majumder, and Niket preprint arXiv:2411.00640.
Tandon. 2023. Aligning language models to user
opinions. In Findings of the Association for Compu- OpenAI. 2023. GPT-4 technical report. arXiv preprint
tational Linguistics: EMNLP 2023. arXiv:2303.08774.

11
Joon Sung Park, Joseph C. O’Brien, Carrie J. Cai, A Details on User Study
Meredith Ringel Morris, Percy Liang, and Michael S.
Bernstein. 2023. Generative agents: Interactive simu- Recruitment. We recruited workers on Prolific
lacra of human behavior. In Proceedings of the 36th to participate in our study. All Prolific workers
Annual ACM Symposium on User Interface Software who were located in the US, fluent in English, and
and Technology (UIST’23).
had not participated in one of our pilots were el-
Ruiyang Ren, Peng Qiu, Yingqi Qu, Jing Liu, Xin igible for our study. We used Prolific’s standard
Zhao, Hua Wu, Ji-Rong Wen, and Haifeng Wang. sample, which distributed our study to available
2024. BASES: Large-scale web search user simu- participants. Based on early pilots, we estimated
lation with large language model based agents. In
Findings of the Association for Computational Lin- that the task took around 25 minutes. We paid all
guistics: EMNLP 2024. participants $5.00 upon completion of the entire
task. We experimented with offering a small bonus
Yijia Shao, Vinay Samuel, Yucheng Jiang, John Yang,
per correct answer, which we discuss below.
and Diyi Yang. 2024. Collaborative gym: A frame-
work for enabling and evaluating human-agent col- All studies were reviewed and approved by the
laboration. arXiv preprint arXiv:2412.15701. Microsoft Research Institutional Review Board
(IRB #10999) and informed consent was obtained
Joseph Suh, Erfan Jahanparast, Suhong Moon, Min-
woo Kang, and Serina Chang. 2025. Language
from all participants prior to participation.
model fine-tuning on scaled survey data for predict-
ing distributions of public opinions. arXiv preprint A.1 Pilots and Incentivization
arXiv:2502.16761. Pilot 1: no incentives. We ran one medium-sized
pilot with 100 participants where we tested all
Michelle Vaccaro, Abdullah Almaatouq, and Thomas
Malone. 2024. When combinations of humans and datasets and models. At this point, we also included
AI are useful: A systematic review and meta-analysis. GPT-4o-mini as a third model, in addition to GPT-
Nature Human Behaviour, 8:2293–2303. 4o and Llama-3.1-8b. In this pilot, we did not in-
Angelina Wang, Jamie Morgenstern, and John P. Dick-
clude incentives for correct answers. Results from
erson. 2025. Large language models that replace this pilot did not show significant differences in
human participants can harmfully misportray and accuracy between GPT-4o and GPT-4o-mini, so we
flatten identity groups. Nature Machine Intelligence. decided to drop GPT-4o-mini from our full study,
Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Ben- so that we could gather more answers per model.
gio, William W. Cohen, Ruslan Salakhutdinov, and
Pilot 2: testing incentives. In our second pilot,
Christopher D. Manning. 2018. HotpotQA: A dataset
for diverse, explainable multi-hop question answer- we wanted to test the effect of including a small
ing. In Proceedings of the 2018 Conference on Em- incentive for getting the correct answer, hypothesiz-
pirical Methods in Natural Language Processing ing that it might improve the ecological validity of
(EMNLP’18). the study since users would try harder to answer the
Wenting Zhao, Xiang Ren, Jack Hessel, Claire Cardie, questions correctly. We included a small bonus of
Yejin Choi, and Yuntian Deng. 2024. WildChat: 1m $0.10 per correct answer, with a maximum bonus
chatgpt interaction logs in the wild. In Proceedings of $1.30 for 13 questions, on top of the same base
of the 12th International Conference on Learning
compensation of $5.00 for completing the task.
Representations (ICLR’24).
While this bonus could help to improve ecolog-
Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Tianle ical validity, there was a risk that the incentives
Li, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, result in users cheating on the study, such as by
Zhuohan Li, Zi Lin, Eric P. Xing, Joseph E. Gonzalez,
Ion Stoica, and Hao Zhang. 2024. LMSYS-Chat-1M:
searching for the question on Google or ChatGPT.
A large-scale real-world LLM conversation dataset. To mitigate this risk, first we repeatedly required
In Proceedings of the 12th International Conference users to acknowledge that they would not use exter-
on Learning Representations (ICLR’24). nal tools (Figures A3 and A7) and we said, “Com-
Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan pensation could be affected if we detect that you
Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, are using external tool.” Second, we ran a second
Zhuohan Li, Dacheng Li, Eric Xing, Hao Zhang, medium-sized pilot with incentives, with 60 partici-
Joseph E. Gonzalez, and Ion Stoica. 2023. Judging pants on the three math datasets, and we compared
LLM-as-a-judge with mt-bench and chatbot arena.
In Proceedings of the 37th International Confer-
the results between Pilots 1 and 2 to see if Pilot 2
ence on Neural Information Processing Systems had unrealistic increases in accuracy that could not
(NeurIPS’23). be explained by slightly more user effort.

12
answers, user-AI answers and conversations, and
user confidence, which we asked users to report
per question before the attempting the question
(Figure A5). ChatBench also includes AI-alone
answers from our AI-alone experiments, where we
tested each model 50 times per question and AI-
alone method (see Section B.1 for details).
In total, ChatBench contains 7,148 user-alone
answers, 7,336 user-AI answers and conversations,
10,828 user confidence answers, and 118,717 AI-
alone answers, resulting in 144,029 answers in total.
ChatBench contains data from more than the total
number of participants we recruited (810 = 650 +
100 + 60), since some participants started but did
Figure A1: Comparing results from Pilot 1 (without not complete the study. In Table A3, we provide
incentives) and Pilot 2 (with incentives). additional data statistics, including how many an-
swers we collected per model, dataset, condition,
and answer type (user-alone or user-AI).
We visualize the mean accuracies per dataset and
model in Figure A1. We found that, as expected, in- Protecting participant privacy. To protect the
centives tended to improve performance a little: out privacy of our participants, first we mapped each
of 27 combinations of math datasets (3), models (3), person’s Prolific ID to a new, randomly gener-
and answer types (i.e., user-alone, user-AI answer- ated string of 10 letters and digits, checking that
first, and user-AI direct-to-AI), the pilot with incen- there were no collisions between individuals. The
tives had a higher mean accuracy 19 times. We also worker_id field in ChatBench contains these new
found that conversations were slightly longer with strings, instead of their original worker ID on Pro-
incentives. However, the overall improvement in lific. Second, we checked for personally identify-
accuracy was very small, only 3 percentage points, ing information in the user-AI conversations. We
meaning we did not see unrealistic improvements used the European Union GDPR’s definition of
that would suggest use of external tools. We also personal data:5
continued to see the gaps in user-AI performance
between the GPT models and Llama-3.1-8b, sug- ‘Personal data’ means any information
gesting users were basing their answers on the AI relating to an identified or identifiable
Chatbot given to them. As further evidence of the natural person (‘data subject’); an iden-
use of the AI Chatbot, and not external tools, we tifiable natural person is one who can
found that in the vast majority of cases (63 out be identified, directly or indirectly, in
of 66 examples) where the user changed from an particular by reference to an identifier
incorrect user-alone answer to a correct user-AI such as a name, an identification num-
answer, that new answer matched the answer given ber, location data, an online identifier
by the AI model in the user-AI conversation. Since or to one or more factors specific to the
we found that incentives seemed to encourage users physical, physiological, genetic, mental,
to try slightly harder, and we did not see evidence economic, cultural or social identity of
of cheating, we decided to keep incentives for our that natural person.
full study, but our pilot comparison shows that our To check if a participant had revealed personal in-
results were not overly sensitive to this decision. formation, we provided our private instance of
GPT-4o with all of their user-AI conversations,
A.2 ChatBench along with GDPR’s definition of personal data, and
Our dataset, ChatBench4 , compiles data over the prompted GPT-4o to answer whether the conversa-
full study (650 participants, with incentives) and tions contained any personal data.
the two pilots. ChatBench consists of user-alone As expected, given the nature of these conversa-
tions (answering benchmark questions), there were
4
https://huggingface.co/datasets/microsoft/
ChatBench. 5
https://gdpr.eu/eu-gdpr-personal-data/.

13
very few conversations with personal data. Over the attention check in our user-alone estimates,
823 participants with user-AI conversations, GPT- since the attention check was easier than the
4o only flagged three participants with personal other questions.
data. We manually inspected these three and found
that two were not actually revealing personal data; • Third, we checked for workers with multiple
they were both rephrasing a math question, “Car- assignments, and we only kept data from those
los Montado was born on Saturday, November 9, workers on their first assignment.
2002”, in first person, leading GPT-4o to think that In our released version of ChatBench, we include
they were providing their birthday. One participant all the data from the full study without filtering and
appeared to share personal details, so we removed the filtered version of the data, for replicability and
their conversations from our public release. transparency.
Filtering from ChatBench for statistical analysis.
A.3 Task Interface
For our main statistical analyses (Section 4), we
only used data from the full study, and not from the We provide screenshots of all of the pages in our
two pilots. We furthermore filtered the data follow- user study interface, including the Introduction
ing the criteria we described in our preregistration,6 Page (Figure A3), Phase 1 Tutorial (Figure A4),
such as only keeping the workers’ answers from Confidence Page (Figure A5), User-Alone Page
their first assignment if they had multiple. While (Figure A6), Phase 2 Instructions (Figure A7),
we could control from the Prolific interface that Phase 2 Tutorial (Figure A8), User-AI Page (Figure
workers could not participate in our task multiple 2), and Feedback Page (Figure A9).
times, once they opened our app, they could start On the User-AI Page (Figure 2), we tried to min-
the study then be taken back to the beginning of imize our influence on the user-AI interactions, but
the study flow (Figure 1) if they refreshed the app. we also wanted to ensure that the users put in mean-
If they did so, we would not want to keep their data ingful effort to interact with AI, as they would if
after they refreshed, since their behavior on the they were intrinsically motivated to answer a ques-
second time around could be affected by what they tion. We required the user to send at least one
already saw the first time. When a worker opened message before moving onto the next question and
the app or refreshed, they received a new assign- we said in the instructions that low-effort conver-
ment, defined by their combination of model (one sations, e.g., only “hi”, would be flagged. We also
of 2), user-AI condition (one of 2), subject (one of disabled copy-and-paste of the question text, both
3), and question batch (one of 7 or 18). The proba- to prevent the use of external tools (e.g., ChatGPT)
bility that they would receive the same assignment and to prevent trivial conversations where the user
twice if they refreshed was very low (less than 1%), simply copy-and-pasted. While users may copy-
so we could check for multiple assignments to test and-paste in the real world if presented with a ques-
whether they refreshed, and we used timestamps tion, we were trying to capture the case where a
to determine their first assignment. Ultimately, we user had intrinsic motivation to answer a question,
found very few workers (3%) with multiple assign- and in those cases, the question would usually be
ments, and for those workers, we kept their data in their heads so they would not have something
from their first assignment. to copy-and-paste. Furthermore, our free-text AI-
Overall, our filtering criteria was as follows: alone method already serves as a good estimate
of user copy-and-paste, since it simply copy-and-
• First, we only kept data from the workers who pastes the question to the AI as the first prompt,
completed the study, which we checked by then uses GPT-4o to extract an answer from the
cross-referencing the list of worker IDs given AI’s free-text response.
to us by Prolific of all workers who completed
the study and clicked on the completion code.

• Second, we only kept data from workers who

passed the attention check. The vast major-
ity of workers (99.5%) passed the attention
check. We also did not include answers from
6
https://aspredicted.org/n84n-sn3f.pdf.

14
Dataset Model Comparison Acc1 SE1 Acc2 SE2 z-value p-value
Elementary Math GPT-4o AI letter zero shot vs. UserAI direct to ai 0.73 0.04 0.92 0.02 -3.92 <0.01
Elementary Math GPT-4o AI letter zero shot vs. UserAI answer first 0.73 0.04 0.90 0.02 -3.43 <0.01
Elementary Math GPT-4o AI letter few shot vs. UserAI direct to ai 0.74 0.04 0.92 0.02 -3.83 <0.01
Elementary Math GPT-4o AI letter few shot vs. UserAI answer first 0.74 0.04 0.90 0.02 -3.34 <0.01
Elementary Math GPT-4o AI free text vs. UserAI direct to ai 0.99 0.01 0.92 0.02 3.03 <0.01
Elementary Math GPT-4o AI free text vs. UserAI answer first 0.99 0.01 0.90 0.02 4.04 <0.01
Elementary Math GPT-4o User alone vs. UserAI direct to ai 0.78 0.03 0.92 0.02 -4.21 <0.01
Elementary Math GPT-4o User alone vs. UserAI answer first 0.78 0.03 0.90 0.02 -3.52 <0.01
High School Math GPT-4o AI letter zero shot vs. UserAI direct to ai 0.51 0.05 0.70 0.04 -3.20 <0.01
High School Math GPT-4o AI letter zero shot vs. UserAI answer first 0.51 0.05 0.73 0.03 -3.92 <0.01
High School Math GPT-4o AI letter few shot vs. UserAI direct to ai 0.49 0.04 0.70 0.04 -3.57 <0.01
High School Math GPT-4o AI letter few shot vs. UserAI answer first 0.49 0.04 0.73 0.03 -4.33 <0.01
High School Math GPT-4o AI free text vs. UserAI direct to ai 0.85 0.03 0.70 0.04 3.14 <0.01
High School Math GPT-4o AI free text vs. UserAI answer first 0.85 0.03 0.73 0.03 2.73 <0.01
High School Math GPT-4o User alone vs. UserAI direct to ai 0.41 0.03 0.70 0.04 -5.88 <0.01
High School Math GPT-4o User alone vs. UserAI answer first 0.41 0.03 0.73 0.03 -7.03 <0.01
College Math GPT-4o AI letter zero shot vs. UserAI direct to ai 0.45 0.07 0.52 0.08 -0.61 0.54
College Math GPT-4o AI letter zero shot vs. UserAI answer first 0.45 0.07 0.52 0.07 -0.72 0.47
College Math GPT-4o AI letter few shot vs. UserAI direct to ai 0.44 0.07 0.52 0.08 -0.72 0.47
College Math GPT-4o AI letter few shot vs. UserAI answer first 0.44 0.07 0.52 0.07 -0.85 0.40
College Math GPT-4o AI free text vs. UserAI direct to ai 0.73 0.06 0.52 0.08 2.23 0.03
College Math GPT-4o AI free text vs. UserAI answer first 0.73 0.06 0.52 0.07 2.40 0.02
College Math GPT-4o User alone vs. UserAI direct to ai 0.28 0.04 0.52 0.08 -2.67 <0.01
College Math GPT-4o User alone vs. UserAI answer first 0.28 0.04 0.52 0.07 -3.10 <0.01
Conceptual Physics GPT-4o AI letter zero shot vs. UserAI direct to ai 0.91 0.03 0.84 0.03 1.74 0.08
Conceptual Physics GPT-4o AI letter zero shot vs. UserAI answer first 0.91 0.03 0.84 0.03 1.70 0.09
Conceptual Physics GPT-4o AI letter few shot vs. UserAI direct to ai 0.96 0.02 0.84 0.03 3.22 <0.01
Conceptual Physics GPT-4o AI letter few shot vs. UserAI answer first 0.96 0.02 0.84 0.03 3.22 <0.01
Conceptual Physics GPT-4o AI free text vs. UserAI direct to ai 0.97 0.02 0.84 0.03 3.62 <0.01
Conceptual Physics GPT-4o AI free text vs. UserAI answer first 0.97 0.02 0.84 0.03 3.63 <0.01
Conceptual Physics GPT-4o User alone vs. UserAI direct to ai 0.55 0.03 0.84 0.03 -6.48 <0.01
Conceptual Physics GPT-4o User alone vs. UserAI answer first 0.55 0.03 0.84 0.03 -6.69 <0.01
Moral Scenarios GPT-4o AI letter zero shot vs. UserAI direct to ai 0.71 0.05 0.79 0.03 -1.47 0.14
Moral Scenarios GPT-4o AI letter zero shot vs. UserAI answer first 0.71 0.05 0.78 0.04 -1.13 0.26
Moral Scenarios GPT-4o AI letter few shot vs. UserAI direct to ai 0.80 0.04 0.79 0.03 0.27 0.79
Moral Scenarios GPT-4o AI letter few shot vs. UserAI answer first 0.80 0.04 0.78 0.04 0.49 0.63
Moral Scenarios GPT-4o AI free text vs. UserAI direct to ai 0.72 0.05 0.79 0.03 -1.26 0.21
Moral Scenarios GPT-4o AI free text vs. UserAI answer first 0.72 0.05 0.78 0.04 -0.93 0.35
Moral Scenarios GPT-4o User alone vs. UserAI direct to ai 0.73 0.03 0.79 0.03 -1.54 0.12
Moral Scenarios GPT-4o User alone vs. UserAI answer first 0.73 0.03 0.78 0.04 -1.05 0.29

Table A1: Results per dataset for GPT-4o, including AI-alone vs. user-AI comparisons and user-alone vs. user-AI comparisons.

15
Dataset Model Comparison Acc1 SE1 Acc2 SE2 z-value p-value
Elementary Math Llama-3.1-8b AI letter zero shot vs. UserAI direct to ai 0.45 0.04 0.86 0.03 -8.58 <0.01
Elementary Math Llama-3.1-8b AI letter zero shot vs. UserAI answer first 0.45 0.04 0.90 0.02 -10.50 <0.01
Elementary Math Llama-3.1-8b AI letter few shot vs. UserAI direct to ai 0.43 0.03 0.86 0.03 -9.39 <0.01
Elementary Math Llama-3.1-8b AI letter few shot vs. UserAI answer first 0.43 0.03 0.90 0.02 -11.53 <0.01
Elementary Math Llama-3.1-8b AI free text vs. UserAI direct to ai 0.88 0.03 0.86 0.03 0.56 0.58
Elementary Math Llama-3.1-8b AI free text vs. UserAI answer first 0.88 0.03 0.90 0.02 -0.65 0.51
Elementary Math Llama-3.1-8b User alone vs. UserAI direct to ai 0.81 0.03 0.86 0.03 -1.26 0.21
Elementary Math Llama-3.1-8b User alone vs. UserAI answer first 0.81 0.03 0.90 0.02 -2.70 <0.01
High School Math Llama-3.1-8b AI letter zero shot vs. UserAI direct to ai 0.32 0.03 0.62 0.04 -6.14 <0.01
High School Math Llama-3.1-8b AI letter zero shot vs. UserAI answer first 0.32 0.03 0.64 0.04 -6.89 <0.01
High School Math Llama-3.1-8b AI letter few shot vs. UserAI direct to ai 0.30 0.02 0.62 0.04 -7.09 <0.01
High School Math Llama-3.1-8b AI letter few shot vs. UserAI answer first 0.30 0.02 0.64 0.04 -7.98 <0.01
High School Math Llama-3.1-8b AI free text vs. UserAI direct to ai 0.64 0.04 0.62 0.04 0.24 0.81
High School Math Llama-3.1-8b AI free text vs. UserAI answer first 0.64 0.04 0.64 0.04 -0.16 0.87
High School Math Llama-3.1-8b User alone vs. UserAI direct to ai 0.45 0.03 0.62 0.04 -3.37 <0.01
High School Math Llama-3.1-8b User alone vs. UserAI answer first 0.45 0.03 0.64 0.04 -3.93 <0.01
College Math Llama-3.1-8b AI letter zero shot vs. UserAI direct to ai 0.35 0.04 0.46 0.07 -1.37 0.17
College Math Llama-3.1-8b AI letter zero shot vs. UserAI answer first 0.35 0.04 0.48 0.07 -1.56 0.12
College Math Llama-3.1-8b AI letter few shot vs. UserAI direct to ai 0.30 0.04 0.46 0.07 -1.97 0.05
College Math Llama-3.1-8b AI letter few shot vs. UserAI answer first 0.30 0.04 0.48 0.07 -2.18 0.03
College Math Llama-3.1-8b AI free text vs. UserAI direct to ai 0.41 0.05 0.46 0.07 -0.57 0.57
College Math Llama-3.1-8b AI free text vs. UserAI answer first 0.41 0.05 0.48 0.07 -0.74 0.46
College Math Llama-3.1-8b User alone vs. UserAI direct to ai 0.40 0.04 0.46 0.07 -0.75 0.46
College Math Llama-3.1-8b User alone vs. UserAI answer first 0.40 0.04 0.48 0.07 -0.93 0.35
Conceptual Physics Llama-3.1-8b AI letter zero shot vs. UserAI direct to ai 0.53 0.05 0.67 0.04 -2.25 0.02
Conceptual Physics Llama-3.1-8b AI letter zero shot vs. UserAI answer first 0.53 0.05 0.73 0.04 -3.22 <0.01
Conceptual Physics Llama-3.1-8b AI letter few shot vs. UserAI direct to ai 0.57 0.04 0.67 0.04 -1.64 0.10
Conceptual Physics Llama-3.1-8b AI letter few shot vs. UserAI answer first 0.57 0.04 0.73 0.04 -2.70 <0.01
Conceptual Physics Llama-3.1-8b AI free text vs. UserAI direct to ai 0.62 0.04 0.67 0.04 -0.77 0.44
Conceptual Physics Llama-3.1-8b AI free text vs. UserAI answer first 0.62 0.04 0.73 0.04 -1.80 0.07
Conceptual Physics Llama-3.1-8b User alone vs. UserAI direct to ai 0.46 0.03 0.67 0.04 -3.91 <0.01
Conceptual Physics Llama-3.1-8b User alone vs. UserAI answer first 0.46 0.03 0.73 0.04 -4.97 <0.01
Moral Scenarios Llama-3.1-8b AI letter zero shot vs. UserAI direct to ai 0.40 0.03 0.72 0.04 -6.01 <0.01
Moral Scenarios Llama-3.1-8b AI letter zero shot vs. UserAI answer first 0.40 0.03 0.74 0.04 -7.42 <0.01
Moral Scenarios Llama-3.1-8b AI letter few shot vs. UserAI direct to ai 0.31 0.03 0.72 0.04 -7.35 <0.01
Moral Scenarios Llama-3.1-8b AI letter few shot vs. UserAI answer first 0.31 0.03 0.74 0.04 -8.86 <0.01
Moral Scenarios Llama-3.1-8b AI free text vs. UserAI direct to ai 0.49 0.03 0.72 0.04 -4.07 <0.01
Moral Scenarios Llama-3.1-8b AI free text vs. UserAI answer first 0.49 0.03 0.74 0.04 -5.15 <0.01
Moral Scenarios Llama-3.1-8b User alone vs. UserAI direct to ai 0.79 0.03 0.72 0.04 1.34 0.18
Moral Scenarios Llama-3.1-8b User alone vs. UserAI answer first 0.79 0.03 0.74 0.04 1.00 0.32

Table A2: Results per dataset for Llama-3.1-8b, including AI-alone vs. user-AI comparisons and user-alone vs. user-AI
comparisons.

16
Count
Model Dataset Condition Answer Type
GPT-4o College Math answer-first userAIAnswer 134
userAnswer 283
direct-to-AI userAIAnswer 116
userAnswer 121
Conceptual Physics answer-first userAIAnswer 317
userAnswer 425
direct-to-AI userAIAnswer 351
userAnswer 117
Elementary Math answer-first userAIAnswer 542
userAnswer 697
direct-to-AI userAIAnswer 462
userAnswer 122
High School Math answer-first userAIAnswer 539
userAnswer 689
direct-to-AI userAIAnswer 463
userAnswer 122
Moral Scenarios answer-first userAIAnswer 242
userAnswer 331
direct-to-AI userAIAnswer 398
userAnswer 135
Llama-3.1-8b College Math answer-first userAIAnswer 119
userAnswer 251
direct-to-AI userAIAnswer 115
userAnswer 123
Conceptual Physics answer-first userAIAnswer 315
userAnswer 428
direct-to-AI userAIAnswer 333
userAnswer 112
Elementary Math answer-first userAIAnswer 485
userAnswer 620
direct-to-AI userAIAnswer 462
userAnswer 123
High School Math answer-first userAIAnswer 477
userAnswer 610
direct-to-AI userAIAnswer 464
userAnswer 125
Moral Scenarios answer-first userAIAnswer 349
userAnswer 471
direct-to-AI userAIAnswer 229
userAnswer 81

Table A3: Dataset statistics for ChatBench.

17
Figure A2: Consent page. Parts are redacted to remain anonymous.

18
Figure A3: Introduction page. Explains the task to users and ensures that they do not consult external tools.

Figure A4: Phase 1 Tutorial. Provides an example of a Phase 1 question before the user begins Phase 1.

19
Figure A5: Confidence page. Included per-question in both phases before the user tries to answer each question.

Figure A6: User-alone page. Users answer all questions on their own in Phase 1 and, if they are in the answer-first
condition, answer each question in Phase 2 on their own first before answering with AI.

20
Figure A7: Phase 2 Instructions. Explains to users what they can expect in Phase 2 and reminds them not to use
external tools. This screenshot shows instructions for a user in the answer-first condition. Users in the direct-to-AI
condition see similar instructions, but without Step 3.

Figure A8: Phase 2 Tutorial. Provides an example of a Phase 2 question before the user begins Phase 2.

21
Figure A9: Feedback page. Final page of the task, where users leave free-text feedback to various questions.

22
B Details on Analyses and Experiments Listing 4: Second prompt for AI-alone free-text. This
second prompt instructs GPT-4o to extract an answer
We download the MMLU datasets7 (Hendrycks (if any) from the model’s free-text response. In order
et al., 2021) and MMLU-Redux datasets8 (Gema to not bias the answer extraction, we do not include the
et al., 2024) from Hugging Face. The datasets correct answer in this prompt.
are protected by the MIT and CC-by-4.0 licenses, Here is a question that someone was asked :

respectively, allowing our use of this data in our re- ================================================

{ question }
search. Our code is available at https://github. A. { option_A }
B. { option_B }
com/serinachang5/interactive-eval. C. { option_C }
D. { option_D }
================================================
B.1 AI-Alone experiments Here is a response :

Here we provide the exact prompts used for each of ================================================

{ response }
the AI-alone methods: few-shot letter-only (Listing ================================================
1), zero-shot letter-only (Listing 2), and the two Did the response provide a final answer to the question ?
Respond with a JSON object that contains one key "
prompts for free-text (Listing 3 and Listing 4). For attempted_answer " with a value that is true or false .
If " attempted_answer " is true , then include a second
all the methods, the system prompt was “You are a key " answer_val " with the final answer ' s value in
quotations . If the final answer value matches one of
helpful AI assistant.” the answer options , include a third key " answer_letter "
with a value that is one of the letters "A", "B", "C",
Listing 1: Prompt for few-shot letter-only, taken from or "D ".

HELM. In-context examples are the five examples in

MMLU’s “dev” set for this dataset.
In ChatBench, we include the results of our AI-
Answer with only a single letter .
alone experiments, where we tested each of the two
models (GPT-4o and Llama-3.1-8b) 50 times per
The following are multiple choice questions ( with answers )
about { dataset }. question and AI-alone method. Testing 50 times
{ example_1 } was necessary since we used a temperature of 0.7,
A. { example_1_option_A }
B. { example_1_option_B } as discussed in the main text. We were able to get
C. { example_1_option_C }
D. { example_1_option_D } 50 answers for almost every model, question, and
Answer : { example_1_answer }
method, barring a few exceptions. For the letter-
...
only methods, the model would occasionally not
{ example_5 }
A. { example_5_option_A } return a valid answer, since its response would be-
B. { example_5_option_B }
C. { example_5_option_C } gin with a character besides “A”, “B”, “C”, or “D”.
D. { example_5_option_D }
Answer : { example_5_answer } Thus, we computed two accuracies: one where the
{ question } invalid answers were treated as incorrect (since the
A. { option_A }
B. { option_B } model failed to follow instructions) and one where
C. { option_C }
D. { option_D }
we computed accuracy over only the valid answers.
Answer :
We report the former accuracy in the paper, but
report both types of accuracies in ChatBench. As
Listing 2: Prompt for zero-shot letter-only, using the expected, invalid answers were more common with
same language as few-shot but dropping the in-context
zero-shot than few-shot, but they were a minor oc-
examples.
currence overall: below 5% of answers were invalid
Answer with only a single letter .
for 90% of questions with zero-shot and 99.6% of
{ question }
A. { option_A } questions with few-shot. Invalid answers were not
B. { option_B }
C. { option_C } an issue for free-text, but we very occasionally had
D. { option_D }
Answer : issues with the answer extraction step (e.g., errors
in JSON parsing), resulting in losing one or two
Listing 3: First prompt for AI-alone free-text. This out of 50 answers for a few questions. For one
prompt to generate the model’s free-text response is of the Moral Scenarios questions, we had issues
simply the question and answer options concatenated. generating free-text responses, since it violated our
{ question }
A. { option_A }
model deployment’s filtering policy.
B. { option_B }
C. { option_C }
D. { option_D } B.2 Statistical details
Mean accuracies. When measuring accuracies
https://huggingface.co/datasets/cais/mmlu.
7

https://huggingface.co/datasets/
8 for all methods (user-alone, AI-alone, and user-
edinburgh-dawg/mmlu-redux-2.0. AI), we first compute per-question accuracies as

23
the fraction of correct answers over total answers
nq for each question, denoted p̂q . We also compute
the standard errorpfor each question-level accuracy
estimate SEq = p̂q (1 − p̂q )/nq . We then com-
pute dataset-level accuracies with an (unweighted)
average across all Q question-level accuracies, and
dataset-level standard errors using decomposition
of total variance to account for both variability in
sampling questions from the larger population of
MMLU questions and variability in correctness of
responses (Miller, 2024):
q
SEtot = (E[SEq ] + Var(p̂q ))/Q. (1)
Figure B1: Fraction of user-AI interactions where the
In Tables A1 and A2, we report mean accuracies last AI answer in the conversation is wrong but the user
for all datasets, models, AI-alone methods, and still answered correctly, by subject and model.
user-AI conditions. We also compare accuracies
between two methods, for AI-alone vs. user-AI and Listing 5: Prompt to GPT-4o for automatically charac-
for user-alone vs. user-AI. We conduct z-tests for terizing user-AI conversations.
all statistical tests comparing accuracies between The following conversation occurred between a user called "
You " and an AI called " Bot ", in which " You " tried to
two methods where use " Bot " to answer a question . Extract the following
information about this conversation .
q 1. Is the first substantive user prompt a: question ,
statement , or other ?
z = (p̂1 − p̂2 )/ SE12 + SE22 . (2) 2. Is the first user question : a near -" exact " rephrasing
of the question , a component of the question , an
erroneous or misinterpreted aspect of the question ,
or other ?
Upper-bound on correlation. Since there is 3. Is the first AI answer : correct , incorrect , or not
yet providing an answer ?
noise in our estimate of user-AI accuracy per ques- 4. Is the last AI answer : correct , incorrect , or not yet
providing an answer ?
tion, we want to check if the low correlations be- 5. Are there more than one AI answer ?
5. Anywhere in the course of the conversation , does AI
tween user-AI and AI-alone accuracies can be ex- correct the user ?
plained by that noise. To test this, we simulate an 6. Anywhere in the course of the conversation , does the
user correct the AI by selecting a different answer
upper bound on what the correlation would be if the than what the AI recommended ?
7. Anything unusual or interesting about this
user-AI accuracies were drawn from the same dis- interaction that you noticed ?

tribution as the AI-alone accuracies, which we as- Here is the question the user was given :
{ question }
sume are perfectly estimated because we test each
Here are the answer choices to the question :
LLM 50 times on each question. We construct hy- { choices }

pothetical user-AI data, where for each question q, Here is the correct answer to the question :
{ correctAnswer }
we draw x from Binomial(nqu , pqf ), where nqu is
Here is the conversation between the user (" You ") and
the number of user-AI answers we had in our study, the AI (" Bot ") :
{ conversation }
pqf is the free-text accuracy on this question, and
Here is what the user selected as their answer :
x/nqu is now the hypothetical user-AI accuracy on { selectedAnswer }

this question. Over 100 iterations, the hypothetical

correlation with free-text ranges from r = 0.88 We used the structured data from this analysis
to 0.92 for direct-to-AI and r = 0.90 to 0.94 for for a number of results, described in Section 4.3,
answer-first, with the real correlations of 0.45 and such as how often user-AI interactions “mirror” AI
0.46 (respectively) well below this range. benchmarks and how often AI provided the correct
answer in the user-AI conversations. We also used
B.3 Additional analyses from user study this structured data to measure how often the user
User-AI conversations. To characterize the con- corrects the AI model’s mistake, by computing the
versations between users and LLMs, we set a sys- fraction of user-AI interactions where the last AI
tem prompt of “You are an expert at extracting answer in the conversation is wrong but the user
information from conversations and MUST return still answered correctly (Figure B1). We find that
a JSON object.” and used the prompt in Listing 5. users are much likelier to correct Llama-3.1-8b
than GPT-4o, which helps to explain how some

24
of the gap in the model’s AI-alone performance is You are mimicking a human .
closed in the user-AI setting. You are trying to choose the correct answer to the given
question .
Please ask an assistant sub - questions for help approaching
User confidence. In Figure B2, we visualize the answers .
In each turn , please only ask one sub - question to interact
relationship between user-reported confidence per with an assistant . In the sub - questions , please include
all necessary information , such as the question and
question and their user-alone accuracy. First, over options , in the original question . If you know the
answer , please output "So , the answer is : A , B , C , or D
our five datasets, we find that users are most confi- ."
{ question }
dent about Moral Scenarios, followed by Elemen- A. { option_A }
B. { option_B }
tary Math, Conceptual Physics, High School Math, C. { option_C }
D. { option_D }
and College Math. The user selects their confidence
YOU : { simulator prompt 1}
from three options (as shown Figure A5), “not con-
SYSTEM : { AI system response 1}
fident”, “somewhat confident”, and “very confi-
...
dent”. We find that users are well-calibrated within
YOU : { simluator prompt k}
dataset: as their confidence increases, so does
SYSTEM : { AI system response k}
the mean accuracy. Users are less well-calibrated
across datasets: for example, users who are very In our simulator experiments, we fine-tune GPT-
confident on a Conceptual Physics question slightly 4o using Azure OpenAI Service. We use the default
underperform those who are only somewhat confi- hyperparameters, with a batch size of 11 and 2
dent on an Elementary Mathematics question. epochs. The training data contains 8,538 training
B.4 Simulator details examples (we describe in Section 5 how each user-
AI conversation with k user utterances becomes
Below we provide the exact prompts for the two-
k + 1 training examples for fine-tuning).
step simulator (Listings 6-8) and the IQA-EVAL
simulator from Li et al. (2024a) (Listing 9).
Listing 6: Two-step user simulator, system prompt for
both tasks.
You are a human user interacting with an AI system , and you
are trying to answer the following question :

{ question }
A. { option_A }
B. { option_B }
C. { option_C }
D. { option_D }

Listing 7: Two-step user simulator, user prompt for Task

1 (user refers to the role in the OpenAI API, not a real
user).
Generate the first prompt you would say to the system to get
started with answering your question . Remember to
write exactly as a real user would .

Listing 8: Two-step user simulator, user prompt for Task

2 (user refers to the role in the OpenAI API, not a real
user).
Here is your conversation so far with the AI system :
========================
YOU : { simulator prompt 1}

SYSTEM : { AI system response 1}

...

YOU : { simluator prompt k}

SYSTEM : { AI system response k}

========================
If your question is answered by this conversation , return
ONLY the answer in the format " Answer : A , B , C , or D ".
If not , generate the next prompt you would say to the
system to answer your question . Remember to keep your
writing style consistent .

Listing 9: IQA-EVAL simulator, only has system

prompt, following the original implementation.

25
Figure B2: Distribution of confidence answers from users and mean user-alone accuracies per confidence answer.

Figure B3: Scatter plot comparing different AI-alone and user simulator methods’ abilities to predict user-AI
accuracy, where the AI system is GPT-4o. Pearson correlations are included in the plot titles.

Figure B4: Scatter plot comparing different AI-alone and user simulator methods’ abilities to predict user-AI
accuracy, where the AI system is Llama-3.1-8b. Pearson correlations are included in the plot titles.