0% found this document useful (0 votes)
18 views27 pages

2024.acl-long.786

Uploaded by

husain017
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views27 pages

2024.acl-long.786

Uploaded by

husain017
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 27

Don’t Hallucinate, Abstain:

Identifying LLM Knowledge Gaps via Multi-LLM Collaboration

Shangbin Feng1 Weijia Shi1 Yike Wang2


Wenxuan Ding3 Vidhisha Balachandran4 Yulia Tsvetkov1
1
University of Washington 2 University of California, Berkeley
3
The Hong Kong University of Science and Technology 4 Carnegie Mellon University
[email protected]

Abstract et al., 2021; Press et al., 2023), and multi-LM col-


laboration (Luo et al., 2023; Feng et al., 2023b).
Despite efforts to expand the knowledge of
However, LLM knowledge gaps might always per-
large language models (LLMs), knowledge
gaps—missing or outdated information in sist due to the ever-evolving nature of knowledge
LLMs—might always persist given the evolv- (Kandpal et al., 2023; Mallen et al., 2023; De Cao
ing nature of knowledge. In this work, we study et al., 2021; Hernandez et al., 2023; Kasai et al.,
approaches to identify LLM knowledge gaps 2024). Consequently, we posit that abstaining from
and abstain from answering questions when generating low-confidence outputs should be a part
knowledge gaps are present. We first adapt ex- of LLMs’ functionality, and ask a crucial research
isting approaches to model calibration or adap-
question: how to identify knowledge gaps in LLMs?
tation through fine-tuning/prompting and an-
alyze their ability to abstain from generating
Developing and evaluating robust mechanisms to
low-confidence outputs. Motivated by their fail- address the abstain problem improves LLM relia-
ures in self-reflection and over-reliance on held- bility, reduces hallucinations, and mitigates biases
out sets, we propose two novel approaches that due to model uncertainty.
are based on model collaboration, i.e., LLMs We hypothesize that there are three ways to op-
probing other LLMs for knowledge gaps, either erationalize the abstain problem: (1) from the data
cooperatively or competitively. Extensive ex- perspective, LLMs should abstain when knowledge
periments with three LLMs on four QA tasks
is missing in training data or curated from unreli-
featuring diverse knowledge domains demon-
strate that both cooperative and competitive able sources; (2) from the modeling perspective,
approaches to unveiling LLM knowledge gaps LLMs should abstain when knowledge is not en-
achieve up to 19.3% improvements on abstain coded in model parameters and cannot be inferred;
accuracy against the strongest baseline. Further (3) from the usage perspective, LLMs should ab-
analysis reveals that our abstention methods stain when generated outputs would provide an
pinpoint failure cases in retrieval augmentation incorrect answer. Since users are directly interact-
and knowledge gaps in multi-hop reasoning.1
ing with and impacted by LLM-generated texts,
1 Introduction not model parameters or training data, we formu-
late and evaluate LLM abstention focusing on the
LLMs demonstrate impressive capabilities of en- factual correctness of LLM generations.
coding real-world knowledge in model parameters Specifically, we focus on abstaining in question-
and leveraging it to aid knowledge-intensive tasks answering (AbstainQA): given an LLM and a ques-
(Petroni et al., 2019; Brown et al., 2020; Yu et al., tion q, we aim to develop an abstain function
2023a). But when such knowledge is missing or un- f (LLM, q) → {true, false} to indicate whether
reliable, they resort to hallucinations (Ji et al., 2023) the LLMs should abstain based on the limitations
and biases (Feng et al., 2023a), while still “speak- of their internal knowledge (Kamath et al., 2020;
ing with confidence.” A growing body of work Jiang et al., 2021; Whitehead et al., 2022). To fa-
seeks to expand LLM knowledge through retrieval cilitate a comprehensive understanding of existing
augmentation (Guu et al., 2020; Borgeaud et al., approaches that can be employed for LLM absten-
2022; Khattab et al., 2022; Shi et al., 2023; Chen tion, we first identify 11 methods from previous
et al., 2023), search engine integration (Nakano works and adapt them to incorporate the abstain
1
Code and data will be publicly available at functionality. We categorize these baselines into
https://github.com/BunsenFeng/AbstainQA. (1) calibration-based, such as employing tempera-
14664
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 14664–14690
August 11-16, 2024 ©2024 Association for Computational Linguistics
ture scaling (Guo et al., 2017; Jiang et al., 2021) to knowledge gaps, C OOPERATE and C OMPETE.
calibrate models and set a confidence score thresh-
old for abstention; (2) training-based, such as em- 2 Identifying Knowledge Gaps in LLMs
ploying linear probing on LLMs’ hidden layers to
We propose to repurpose existing approaches to en-
evaluate the veracity of generated texts (Slobod-
able abstain functionality in the face of knowledge
kin et al., 2023; Azaria and Mitchell, 2023); (3)
gaps in LLMs. We conceptually categorize these
prompting-based, such as prompting LLMs to re-
approaches into calibration (§2.1), training (§2.2),
quest more information before answering questions
prompting (§2.3), and self-consistency (§2.4) meth-
(Feng et al., 2023b); and (4) self-consistency based,
ods. Empirically observing that these approaches
such as generating multiple chains-of-thought (Wei
are dependent on external knowledge/data sets and
et al., 2022).
often suffer from instability, we propose a new cat-
Among the 11 baselines, a majority and often the
egory based on multi-LLM collaboration (§2.5).
stronger approaches would require a held-out set
for training and hyperparameter tuning, potentially 2.1 Calibration-Based
harming generalization across knowledge domains.
Token Probability LLMs are often inherently
In addition, these approaches often rely on a “self-
calibrated to different extents (Radford et al., 2019;
reflection” assumption that a single LLM could be
Liang et al., 2023) and token probabilities might be
employed or adapted to evaluate its own generated
employed off-the-shelf. Let (q, ā) ∈ H be a held-
texts; however, challenges such as hallucination
out set of questions and gold answers for hyper-
and confirmation biases cast doubt on this assump-
parameter tuning, while LLM produces an answer
tion (Ji et al., 2023; Xie et al., 2023).
a = LLM(q) with a confidence score based on to-
We propose multi-LLM collaboration-based
ken probabilities p(a) ∈ [0, 1]. We set a threshold
approaches, enabling robust LLM abstention
p∗ by minimizing abstain errors on H, formally:
through multi-LLM collaboration to reflect on gen-
erated text in cooperative or competitive settings
while removing the need for held-out sets. For mul- X X
p∗ = arg min 1(p(a) < p′ ) + 1(p(a) ≥ p′ )
tiple LLMs to work in cooperation (C OOPERATE), p′ ∈[0,1] a=ā a̸=ā
the LLM employs other models to provide feed-
back on the proposed answer and reasoning, and where 1 denotes the indicator function. At infer-
synthesizes the outputs into an overall abstain de- ence time, the LLM abstains when p(a) < p∗ .
cision. For a competitive setting (C OMPETE), the
LLM is challenged by other LLMs with conflicting Temperature Scaling Before setting a threshold
knowledge, producing an abstain decision based on p∗ on confidence scores, temperature scaling (Guo
whether it continues to prefer the proposed answer et al., 2017; Jiang et al., 2021) first reshapes the to-
in the presence of conflicting evidence. ken probability distributions by introducing a scalar
We evaluate the baselines and proposed collabo- temperature τ > 0. Given an answer probability
ration approaches for abstention with three LLMs, distribution a, we transform it into softmax(a/τ ),
on four knowledge-intensive QA tasks spanning then optimize τ on the held-out set H with cross-
diverse knowledge domains and reasoning. Ex- entropy loss to obtain τ ∗ . A probability threshold
tensive experiments demonstrate that C OOPERATE p∗ is then set over {softmax(LLM(q)/τ ∗ )}q∈H
and C OMPETE are strong abstain mechanisms that and applied at inference time.
outperform all baselines in 9 of the 12 settings
Ask for Calibration In addition to token prob-
across tasks and models, achieving an improvement
abilities, LLMs could also verbalize confidence
of up to 19.3% in abstain accuracy. Further anal-
scores: we follow Tian et al. (2023) to obtain
ysis reveals that the proposed collaboration-based
confidence scores p(a) by employing a two-step
approaches could help identify failure cases of re-
process: 1) generating an answer with the LLM
trieval augmentation, pinpoint knowledge gaps in
and 2) eliciting a confidence score by prompting
multi-hop reasoning, and more. Our contributions
with “Provide the probability that your guess is
are (1) a critical evaluation and typology of diverse
correct.”2 A threshold p∗ is similarly set over ver-
existing methods, repurposed to identify knowl-
balized scores and applied at inference time.
edge gaps in LLMs, as well as (2) two novel, robust
multi-LLM collaboration methods to detect LLM 2
Full prompt templates are in Appendix B.

14665
Cooperate Question Proposed Answer Compete Question Proposed Answer
LLM LLM

self Factual Expert Commonsense Expert Math Expert

“Please propose an Alternative Alternative Alternative


Domain Domain Domain alternative answer.” Answer 1 Answer 2 Answer 3
Knowledge 1 Knowledge 2 Knowledge 3

“Please review the proposed answer and provide feedback on its correctness.”
“Generate a knowledge
Alternative Alternative Alternative
Feedback 1 Feedback 2 Feedback 3 paragraph about
Knowledge 1 Knowledge 2 Knowledge 3
<alternative answer>.”

others
Feedback 1 Feedback 2 Feedback 3 “Answer the question
with the knowledge: feel New Answer 1 New Answer 2 New Answer 3
LLM 1 LLM 2 LLM 3
free to ignore irrelevant
or wrong information.”
same different different
Judge “Based on the feedback, the proposed answer is:
A. True
B. False Proposed Answer
The answer is B. False” Abstain Abstain

Figure 1: Overview of collaboration-based approaches for LLM abstention: C OOPERATE and C OMPETE.

2.2 Training-Based correctness of the generated answer a = LLM(q):


if it is correct, we use {instruction|q|a} as an in-
Hidden Layers Previous works demonstrate that
struction tuning instance; otherwise, we replace the
probing LLMs’ hidden layer representations might
incorrect answer a with an abstain message and
yield insights into their knowledge and factuality
train on {instruction|q|abstain message}. In this
(Slobodkin et al., 2023; Azaria and Mitchell, 2023;
way, the LLM abstains if the abstain message is
CH-Wang et al., 2023). We extract the hidden em-
generated at inference time for a given q.
beddings eq = LLM-hidden(q), then conduct lin-
ear probing to predict ℓ ∈ {accept, reject} through 2.3 Prompting-Based
p(ℓ | q) ∝ exp(FFN(eq )). We optimize the cross-
Self-Reflect Previous studies show that LLMs
entropy loss on the held-out set H. At inference
might have preliminary capabilities of judging and
time, the LLM should abstain if arg maxℓ p(ℓ | q)
evaluating their own answer (Kadavath et al., 2022).
indicates that the answer should be rejected.
To this end, we prompt the LLM to self-reflect
External Verifier Training an external verifier directly after its generated answer with “The above
to judge the veracity of LLM-generated text has answer is: A. True B. False”. LLMs should abstain
shown great potential in math reasoning (Cobbe when they deem the generated answer as false.
et al., 2021) and general knowledge domains (Ka- More Information Since the internal knowledge
davath et al., 2022; Cohen et al., 2023). We train of LLMs is often incomplete and noisy, existing
an encoder-based language model LM-enc and em- works explore how to selectively request external
ploy the [CLS] token for classification. Formally, information to aid LLM generation (Xu et al., 2023;
p(ℓ | q) ∝ exp(LM-enc(q; LLM(q))) is opti- Asai et al., 2023; Feng et al., 2023b). Following
mized on H with the cross-entropy loss and an Feng et al. (2023b), we append a prompt about
abstain decision is achieved through arg maxℓ p(ℓ | whether more information is needed to answer a
q) at inference time. Here both the question and given question: “Do you need more information
the generated answer are employed to train the to answer this question? (Yes or No)”. The LLM
external verifier, different from the hidden layers should abstain if more information is requested.
approach where only the intermediate encodings of
the question are used for linear probing. Generate and Match There might be gaps be-
tween LLM generation and “understanding” (West
Instruction Tuning A promising way to adapt et al., 2023), which leads to hallucinations when
LLMs is through tuning LLMs on instruction questions are presented in a multiple-choice format
triplets in the format of {instruction|input|output} (Fu et al., 2023). Instead of directly choosing from
(Ouyang et al., 2022). Here we investigate whether one of the options, we instruct the LLM to first gen-
abstaining could be baked into existing LLMs erate an answer without access to potential options.
through instruction tuning. We first employ an in- We then prompt the LLM to evaluate whether the
struction to encourage LLMs to abstain when neces- generated answer matches any of the options: the
sary. Given the held-out set H, we first evaluate the LLM should abstain if there’s no match between
14666
the generated answer and options. whether to abstain. Formally, given question q
and answer a = LLM(q), we obtain a set of nat-
2.4 Consistency-Based ural language feedback {f 1 , . . . , f k } from expert
None-of-the-Above Previous works have inves- LLMs and employ a judge LLM to summarize and
tigated LLMs’ robustness to possibilities of non- synthesize the feedbacks into an abstain decision
of-the-above (nota) in QA tasks (Kadavath et al., LLM(q, a, {f 1 , . . . , f k }) → {accept, reject}.
2022; Ding et al., 2023). We propose an abstain We employ two modes to generate feedback:
mechanism by adding an extra nota option to every for self, the same LLM is specialized into experts
question: the LLM should abstain if nota is se- on domains {d1 , . . . , dk } through prompting-based
lected as the answer, indicating its low consistency self-specialization: we prompt the LLM to gener-
in answering the question. ate a knowledge passage knowledgei about q with
a focus on domain di . A domain-specific feedback
Self-Consistency Threshold Self-consistency
is then generated by prepending the knowledge pas-
(Wang et al., 2022) aims to generate multiple
sage: f i = LLM(knowledgei , q, a), and prompt-
chain-of-thought (Wei et al., 2022) reasoning paths
ing the model to respond as a reviewer. For oth-
and conduct majority voting to determine the an-
ers, we use separate LLMs {LLM1 , . . . , LLMk }
swer. We develop an abstain mechanism by propos-
to provide feedback f i = LLMi (q, a), aiming to
ing plurality index: given a question q along
identify the knowledge gaps in each other and com-
with the k generated reasoning paths and answers
plement the abstain process.
{(ai , pi )}i=1,...,k , the plurality index is defined as:
Compete LLMs have varying preferences for
X knowledge conflicts, i.e. when there is a conflict
plu(LLM, q, k) = max 1(aj = ai )
ai between LLM’s internal parametric knowledge and
j=1,...,k
external information provided in the prompt (Xie
where it indicates how large the plurality is among et al., 2023; Wang et al., 2023b). We hypothesize
all k reasoning paths. A threshold τ ∗ ∈ [0, 1] is that an LLM should abstain when it is impacted by
then optimized on H and applied at inference time conflicting information generated by other LLMs
to abstain when plu(LLM, q, k) < τ ∗ · k. and does not stick to the original answer. Con-
cretely, for question q and LLM-generated an-
2.5 Proposed: Multi-LLM Collaboration
swer a = LLM(q), we prompt other LLMs to
Existing approaches often adapt an LLM to evalu- ′ ′
generate k alternative answers {a1 , . . . , ak } along
ate/reflect on its own generated texts, while chal- with a knowledge passage knowledgei about ai .

lenges such as confirmation bases and hallucination We then instruct the LLM to answer q again
would result in unreliable self-evaluation (Kada- with conflicting information prepended: ãi =
vath et al., 2022; Ji et al., 2023; Xie et al., 2023). LLM(knowledgei , q). If a = ãi , the LLM sticks
In addition, the stronger approaches rely on held- to the original answer and should be thus accepted;
out sets and result in weaker generalization across otherwise, the LLM is swayed by conflicting in-
knowledge domains. We propose to enable robust formation generated by other LLMs, betraying its
LLM abstention through multi-LLM collabora- low confidence. This process is repeated for k al-
tion, i.e., employing multiple LLMs that interact ternative answers and the LLM should abstain if
with each other to evaluate the veracity of gener- the answer changes in a majority of cases.
ated texts and make abstain decisions (Figure 1).
3 Experiment Settings
Cooperate LLMs often come with varying
knowledge coverage with the potential of comple- Models We evaluate LLM abstention baselines
menting each other (Yu et al., 2023a; Du et al., and our proposed collaboration-based approaches
2023; Bansal et al., 2023), while the internal knowl- with three LLMs featuring varying sizes and open-
edge of one LLM could also be diversified and spe- ness: Mistral-7B (Jiang et al., 2023), LLaMA2-70B
cialized (Kang et al., 2023; Si et al., 2023). We (Touvron et al., 2023), and ChatGPT. We posit that
tap into this knowledge variation in the C OOPER - a robust abstain mechanism should ideally work
ATE approach by having different expert LLMs for all LLMs, weak and strong. We set the default
generate feedback on LLM-proposed answers and sampling temperature to 0.1, and employ 0.7 where
employing an LLM as the final judge to decide multiple runs are required.
14667
MMLU K-Crosswords Hellaswag Propaganda
Method R-Acc ER A-Acc A-F1 R-Acc ER A-Acc A-F1 R-Acc ER A-Acc A-F1 R-Acc ER A-Acc A-F1
M ISTRAL -7B
P ROBS .570 .109 .608 .456 .251 -.351 .397 .422 .456 -.041 .599 .659 .337 -.150 .590 .680
T EMP. .565 .104 .601 .426 .250 -.452 .303 .182 .451 -.047 .601 .659 .340 -.170 .585 .661
A SK C ALI . .648 .141 .639 .647 .233 -.023 .705 .825 .455 -.017 .616 .734 .231 -.035 .705 .824
H IDDEN .449 -.085 .420 .137 .104 -.788 .107 .009 .369 -.198 .424 .336 .223 -.510 .240 .084
V ERIFIER .568 .083 .586 .534 .208 -.083 .805 .889 .405 -.080 .550 .628 .448 -.015 .760 .853
I NSTRUCT .709 .198 .688 .693 - - - - .616 .075 .701 .771 - - - -
R EFLECT .498 -.002 .495 .477 .111 -.686 .203 .207 .371 -.150 .477 .500 .216 -.275 .490 .602
M ORE I NFO .504 .006 .500 .216 .272 -.455 .273 .003 .379 -.227 .396 .120 .219 -.540 .240 .073
G EN +M ATCH .511 .021 .515 .129 .111 -.694 .193 .188 .377 -.174 .458 .415 .226 -.545 .230 .013
NOTA .516 .029 .528 .163 .098 -.797 .102 .011 .371 -.244 .387 .105 .259 -.410 .340 .267
SC T HRES . .604 .140 .641 .551 .210 -.090 .793 .880 .493 -.004 .614 .713 .273 -.100 .685 .799
C OOP - SELF .571 .059 .564 .601 .260 -.437 .313 .311 .406 -.042 .601 .719 .297 -.095 .680 .797
C OOP - OTHERS .688 .213 .712 .692 .266 -.022 .761 .852 .626 .092 .725 .783 .182 -.140 .625 .757
C OMPETE .735 .140 .640 .700 .289 -.129 .597 .722 .573 .032 .658 .766 .302 -.055 .700 .805
LL A MA2-70B
A SK C ALI . .624 .025 .435 .568 - - - - .944 .032 .498 .649 - - - -
H IDDEN .473 -.018 .400 .446 .282 -.265 .423 .466 .507 .007 .497 .479 - - - -
V ERIFIER .665 .201 .609 .511 .443 -.056 .634 .694 .522 .009 .504 .617 .259 -.065 .755 .855
I NSTRUCT .745 .216 .628 .640 .288 .024 .606 .772 .475 -.008 .487 .615 - - - -
R EFLECT .616 .121 .529 .469 .383 -.155 .528 .537 .569 .076 .560 .530 .198 -.520 .275 .225
M ORE I NFO .590 .177 .590 .042 .306 -.382 .308 .020 .516 .032 .518 .044 - - - -
G EN +M ATCH .667 .050 .450 .560 .248 -.111 .573 .708 .484 -.004 .477 .614 .082 -.205 .620 .759
NOTA .592 .167 .583 .181 .323 -.295 .388 .280 .516 .027 .522 .236 .185 -.580 .225 .124
SC T HRES . .684 .247 .656 .534 .426 -.090 .590 .617 .667 .100 .590 .655 .412 -.030 .760 .852
C OOP - SELF .615 .150 .550 .400 .463 -.030 .640 .714 .649 .110 .600 .643 .222 -.250 .500 .615
C OOP - OTHERS .694 .262 .676 .562 .402 -.063 .636 .757 .700 .238 .704 .677 .329 -.125 .675 .774
C OMPETE .782 .148 .552 .608 .323 -.080 .642 .760 .611 .047 .525 .625 .161 -.210 .595 .729
C HAT GPT
P ROBS .774 .421 .715 .457 .600 .187 .587 .122 .750 .278 .599 .476 .333 -.015 .625 .765
T EMP. .769 .419 .716 .452 .616 .214 .619 .216 .750 .278 .595 .468 .250 -.010 .630 .772
A SK C ALI . .694 .385 .690 .006 .601 .202 .601 .010 .672 .344 .672 .006 .444 -.015 .580 .712
V ERIFIER .788 .301 .599 .483 - - - - .667 .305 .634 .120 .441 -.035 .570 .672
I NSTRUCT .840 .422 .748 .579 .752 .310 .709 .627 .817 .429 .751 .614 .605 .045 .645 .744
R EFLECT .752 .336 .630 .411 .784 .239 .641 .633 .754 .377 .701 .487 .571 .015 .615 .742
M ORE I NFO .721 .246 .546 .390 .605 .145 .553 .380 .675 .224 .548 .339 .416 -.145 .470 .293
G EN +M ATCH .737 .350 .652 .383 .660 .083 .486 .550 .712 .182 .506 .447 .365 -.115 .490 .568
NOTA .719 .389 .692 .260 .644 .163 .565 .480 .689 .307 .628 .268 .400 -.120 .485 .488
SC T HRES . .766 .424 .743 .447 .637 .216 .622 .382 .749 .366 .688 .468 .399 -.160 .440 .309
C OOP - SELF .841 .436 .726 .578 .794 .175 .646 .646 .878 .344 .670 .628 .684 .070 .710 .802
C OOP - OTHERS .780 .362 .660 .479 .659 .109 .509 .536 .790 .350 .676 .565 .790 .321 .647 .543
C OMPETE .947 .306 .602 .583 .875 .034 .441 .589 .939 .172 .490 .545 .611 .040 .670 .795

Table 1: Performance of abstain strategies on four datasets and three LLMs. Best results in bold and second best in
underline. Approaches are color-coded per category: calibration, training, prompting, consistency, and collaboration.
Certain incompatible cases, e.g. EMBEDDING with the black-box C HAT GPT, are omitted. “-” indicates that this
approach fails to produce meaningful abstain decisions: almost always abstain, didn’t follow instructions, etc.
C OOPERATE and C OMPETE achieve the best performance in 9 of the 12 settings in terms of reliable accuracy.

Tasks and Datasets We evaluate LLM absten- splits of held-out sets and test sets for the four
tion with four datasets spanning diverse knowl- datasets with details in Appendix B.1. We em-
edge domains and reasoning scenarios: 1) MMLU ploy LLMs to answer these questions with zero-
(Hendrycks et al., 2020) is a multiple-choice shot prompting and greedy decoding to obtain the
dataset for general knowledge QA; 2) Knowledge ground truth of whether the LLM “knows” the an-
Crosswords (Ding et al., 2023) is a structured QA swer and whether it should abstain. We posit that a
dataset that focuses on multi-hop and composi- robust abstain mechanism should work for knowl-
tional knowledge reasoning; 3) Hellaswag (Zellers edge tasks across domains and reasoning contexts.
et al., 2019) is a natural language inference dataset Correct Incorrect
Evaluation Metrics We
that tests commonsense knowledge and reasoning; Answered A C illustrate the four scenar-
4) Propaganda (Piskorski et al., 2023) tasks LLMs ios under AbstainQA in
B D
with identifying the 23 propaganda tactics in a Abstained
Figure 2 and employ the
long news article based on their internal knowl- following evaluation met-
Figure 2: Four out-
edge. The datasets are all in English. We create comes of AbstainQA. rics. (1) Reliable Accu-
14668
A
racy (R-Acc): A+C indicates to what extent could stention quality. For A-Acc, Mistral-7B, LLaMA2-
LLM-generated answers (not abstained) be trusted. 70B, and ChatGPT achieve 0.524, 0.537, and 0.616
i.e., out of all questions answered, how many are A-Acc scores on average across approaches and
correct? (2) Effective Reliability (ER) (White- datasets, which aligns well with their relevant su-
A−C
head et al., 2022; Si et al., 2023): A+B+C+D periority on standard benchmarks. In addition,
strikes a balance between reliability and cover- our proposed collaboration-based approaches also
age. i.e., out of all questions, what proportion work better with the strongest ChatGPT, while also
more are answered correctly than incorrectly? improving the abstention performance of weaker
A+D
(3) Abstain Accuracy (A-Acc): A+B+C+D eval- LLMs over baselines.
uates whether the abstain decisions are correct:
LLMs should abstain when it would provide an Going beyond generic QA sheds light on the
incorrect answer and vice versa. (4) Abstain F1 limitations of existing approaches. While most
(A-F1): harmonic-mean(precision, recall), where baselines could function properly on MMLU, some
precision = B+D D
and recall = C+DD
, a balanced collapse (“-” in Table 1) in other knowledge do-
metric between reliability and answer coverage. mains or reasoning contexts: A SK C ALIBRATE,
among others, struggle to follow instructions when
4 Results the prompt context is long in detecting propaganda
tactics; I NSTRUCTION T UNING collapses and pro-
We present the performance of abstain mechanisms duces 100% abstention due to imbalances in the
with three LLMs on four tasks in Table 1. held-out set in K-Crosswords where problems are
C OOPERATE and C OMPETE are state-of- much harder and the LLM should mostly abstain;
the-art abstain mechanisms. Our proposed calibration approaches struggle to set a good thresh-
collaboration-based approaches outperform the old when the model’s base performance is too
strongest baseline in 9 out of the 12 settings (3 mod- low. However, our proposed collaboration-based
els and 4 datasets), achieving an average improve- approaches work well with all four tasks, espe-
ment of 10.1% on reliable accuracy scores. We cially with the strongest ChatGPT language model,
find that C OOPERATE works better with stronger indicating its robustness and broad compatibility
models such as ChatGPT: we hypothesize that the for knowledge-intensive tasks across domains and
complexities of the multi-step feedback generation reasoning contexts.
warrant a stronger base LLM. C OMPETE empha-
5 Analysis
sizes reliability and greatly avoids wrong answers
evident in the high reliable accuracy scores: we ob- Abstain Absolute Aside from QA datasets
serve that LLMs stick to the original answers only where abstain decisions should be made based on
in the most confident scenarios, making C OMPETE the potential correctness of answers, we investi-
ideal for abstention when reliability is paramount. gate scenarios where LLMs should 100% abstain.
Specifically, we employ AmbigQA (Min et al.,
Calibration > Training > Consistency > Prompt-
2020), where LLMs should always abstain from
ing On average, the four types of approaches
ambiguous and underspecified questions. We also
achieved 0.595, 0.576, 0.553, and 0.475 A-Acc
curate ElectionQA23, a QA dataset focusing on
scores. While simple prompting-based approaches
elections in 2023 around the globe (details in Ap-
struggle to elicit self-reflection, our proposed
pendix B.1) and LLMs should abstain due to the
C OOPERATE and C OMPETE promote collabora-
temporal mismatch of training data cutoff, i.e. if
tive abstention and greatly improve performance.
the LLM has only seen training data before 2023.
Among all baselines, instruction tuning and self-
Other potential abstain absolute scenarios include
consistency threshold stand out as two strong ap-
known unknowns (Amayuelas et al., 2023) and
proaches: however, they both rely on a held-out set
questions with false/underspecified premises (Pati-
for training and hyperparameter tuning: we further
dar et al., 2023; Hu et al., 2023). Figure 3 demon-
investigate their generalizability in Section 5.
strates that our proposed C OMPETE consistently ab-
Abstain capabilities potentially correlate with stains the most in abstain absolute scenarios across
base LLM utility. Abstain accuracy (A-Acc) is three LLMs and two datasets: we observe that the
a metric independent of LLMs’ underlying perfor- proposed alternative answers in C OMPETE often
mance on a QA dataset and solely evaluates ab- provide contradictory contexts, thus the LLMs suc-
14669
Mistral-7B LLaMA2-70B ChatGPT Method Ret. R-Acc ER A-Acc A-F1
100 99.5 99.9
83.8 89.6 88.982.6
abstain (%)
74.5
AmbigQA

59.3 57.1 59.1 64.956.060.8 ✗ 0.604 0.140 0.641 0.551


50 41.3 45.0 40.5 SC T HRES .
32.4 30.527.6 ✓ 0.634 0.160 0.652 0.611
11.6
0 0.8 ✗ 0.648 0.141 0.639 0.647
100 98.5 A SK C ALI .
88.591.590.0 ✓ 0.515 0.027 0.512 0.156
ElectionQA23

87.0
abstain (%)

72.0 66.0 78.0


67.5
59.5 63.5 58.5 52.0 ✗ 0.571 0.059 0.564 0.601
50 34.0 35.0
45.0 C OOP - SELF
12.0
23.0
31.5
✓ 0.567 0.048 0.536 0.589
6.0
0
reflect
0.0
MoreInfo SC Thres. Ask Cali. Coop-selfCoop-others Compete ✗ 0.735 0.140 0.640 0.700
C OMPETE
✓ 0.720 0.128 0.618 0.670
Figure 3: Performance of abstain mechanisms in the ab-
stain absolute scenarios where the LLM should abstain Table 2: Performance of abstain strategies on MMLU
for 100% of questions. C OMPETE achieves the highest with Mistral-7B with or without retrieval (Ret.).
abstention rate on average across LLMs and datasets.

Model: Mistral-7B, Dataset: MMLU


Method MMLU K-Cross. Hellaswag Propaganda
w/o retrieval w/ retrieval abstain + retrieval
M ORE I NFO 0.4433 0.7281 0.5830 0.6143
P ROBS 0.4253 0.6643 0.4992 0.5302
SC T RES . 0.2854 0.3428 0.2146 0.2750
—> retrieve A SK C ALI . 0.1747 0.2664 0.2088 0.3755
V ERIFIER 0.3343 0.2000 0.3624 0.2112
50.1%
C OOP - SELF 0.4740 0.1131 0.3434 0.2263
41.9%
18.2% C OOP - OTHERS 0.3560 0.5330 0.2164 0.2380
C OMPETE 0.4484 0.3929 0.4014 0.1935
Model: Mistral-7B, Dataset: K-Crosswords
w/o retrieval w/ retrieval abstain + retrieval
Table 3: Abstain estimated calibration error (ECE) of
approaches with Mistral-7B, the lower the better. The
57.5% calibration of abstention scores for approaches could
—> retrieve
vary significantly across tasks and datasets.
75.0% 73.6%

Figure 4: Performance of C OMPETE with M ISTRAL -


7B in the two-step abstain and retrieval setting. The trieved document from the WikiSearch API. Table
proposed abstain-retrieve-abstain pipeline successfully 2 demonstrates that while baselines such as ASK
reduces the incorrect rate by at least 21.2%. CALIBRATE are negatively impacted by retrieval,
our proposed C OOPERATE and C OMPETE are in
general robust towards additional prepended con-
cessfully abstain due to the knowledge conflicts. text through retrieval.
We further divide the ElectionQA23 dataset into
subsets of which continent is the election held in Building upon the findings that collaboration-
and present model performance in Figure 5. It is based approaches are robust with retrieved contexts,
illustrated that LLMs tend to abstain less and as- we propose a 2-step abstain-based framework to
sume the outcomes of future elections for Africa selectively activate retrieval and identify retrieval
and Asia, indicating fairness concerns of abstain failures. Specifically, abstain decisions are first
mechanisms that might underserve marginalized made based on QA without retrieval. If the LLM
communities and underdeveloped countries. abstains, retrieval is then requested and the abstain
mechanism is applied to QA with the retrieved con-
Abstain and Retrieval Retrieval-augmented lan- text: if the LLM still decides to abstain, it indicates
guage models (Khattab et al., 2022; Shi et al., that the retrieval has failed to find relevant/helpful
2023; Baek et al., 2023; Yu et al., 2023b; Thakur documents. We evaluate this framework on MMLU
et al., 2023) have become the go-to approach for and Knowledge Crosswords datasets while compar-
augmenting LLMs’ knowledge access, while re- ing it with no retrieval and full retrieval settings.
trieving from an external corpus could also intro- Results in Figure 4 demonstrate that this abstain-
duce noisy documents and lead to wrong answers retrieve-abstain pipeline successfully cuts back the
(Wang et al., 2023c; Asai et al., 2023). We in- error rate by up to 56.6%, highlighting the poten-
vestigate whether abstain mechanisms could work tial of abstain mechanisms in identifying retrieval
with retrieval, specifically when questions in the failures and enhancing the reliability of retrieval-
MMLU dataset are presented together with a re- augmented LLMs.
14670
Africa Asia America Europe
100 87.5 91.5 92.194.390.1
78.985.7 80.4
abstain (%)
68.471.469.0
58.965.854.359.2 57.1 63.262.963.4
50 30.428.928.635.2
46.4
17.9 15.5
2.6 5.7
0 reflect MoreInfo SC Thres. Ask Cali. Coop-self Coop-others Compete
Figure 5: Abstain rate on Election23QA with Mistral-7B divided into where the election takes place. The lowest
abstain rate for each approach among continents is highlighted in bold.

SC Threshold Ask Calibrate


tion of abstain likelihood generally correlates with
the abstain performance, with ASK CALIBRATE
step 1
step 1
step 2
and our proposed C OOPERATE and C OMPETE be-
step 2
ing the top-performing approaches. This indicates
25.6% 38.9% that our proposed collaboration-based approaches
Cooperate - self Compete
could also provide well-calibrated abstain likeli-
hood scores where fine-grained decision-making is
step 1
step 2 required.
step 1
step 2

Precision and Recall We observe that among


4.6% 2.8%
the 14 abstain mechanisms, some emphasize an-
swer coverage while others emphasize answer re-
Figure 6: Performance of abstain mechanisms with
liability. We illustrate the abstain precision and
ChatGPT solving 3-hop questions in K-Crosswords.
C OOPERATE and C OMPETE are better at localizing recall of approaches in Figure 7: It is illustrated
knowledge gaps in multi-hop reasoning steps and re- that while most approaches have higher precision
ducing the incorrect rate by at least 82%. than recall and thus lean towards answer cover-
age, C OMPETE has a much higher abstain recall
and stresses answer reliability. We argue that these
Abstain in Multi-Hop In multi-hop knowledge strategies could be ensembled to achieve balanced
reasoning, LLMs’ knowledge is often limited in abstain decisions or employed based on whether
some, but not all, of the reasoning steps. We investi- the application scenario should optimize coverage
gate whether our proposed abstain strategies could or reliability.
pinpoint the specific steps where LLMs should ab-
stain due to knowledge limitations: we employ 6 Related Work
the three-hop subset of the K-Crosswords dataset,
Previous works on knowing what language models
conduct abstain mechanisms on each of the hops
know have focused on calibration, prompting, or
separately, and present results in Figure 6. It is
training. Calibration-based approaches attempt to
demonstrated that our proposed C OOPERATE and
extract confidence scores from models to gauge
C OMPETE are better in localizing knowledge lim-
their uncertainty (Sun et al., 2022; Kuhn et al.,
itations, with a final error rate of 67.2% to 81.2%
2022; Zhou et al., 2023a; Liu et al., 2023a). Pre-
lower than the strongest baselines.
vious works have evaluated (Radford et al., 2019;
Abstain ECE Aside from a binary decision, we Liang et al., 2023; Tao et al., 2023; He et al., 2023)
posit that abstain mechanisms could also provide and improved the calibration of language models
a continuous abstain likelihood based on token (Desai and Durrett, 2020; Kong et al., 2020; Ja-
probabilities, heuristics, and more3 . We calculate gannatha and Yu, 2020; Kamath et al., 2020; Jiang
the abstain ECE, estimated calibration error (Guo et al., 2021; Mielke et al., 2022; Lin et al., 2022;
et al., 2017) but with abstain probabilities and ab- Tian et al., 2023; Zhou et al., 2023b), while the cali-
stain accuracy, on the MMLU dataset and present bration quality could vary for different downstream
in Table 3. It is demonstrated that the calibra- tasks (Desai and Durrett, 2020; Wang et al., 2020;
Stengel-Eskin and Van Durme, 2023; Kalai and
3
Details in Appendix B. Vempala, 2023). In the setting of AbstainQA where
14671
abstain precision abstain recall
100
84.2
74.8 74.2 74.3
75 70.8 69.3
61.6 64.7 65.0 63.0 66.5 66.9 71.8 60.0
57.1 55.9
50 47.9 45.7 49.1 46.3 47.1
27.5 29.8 32.9
25 14.0
7.3 9.1 9.2
0 Gen+Match Embed. NOTA MoreInfo Temp. Probs SC Thres. Reflect Verifier Coop-othersCoop-self Ask Cali. Instruct Compete

Figure 7: Abstain precision and recall of approaches on MMLU with Mistral-7B. While most approaches feature
higher precision, C OMPETE achieves the best abstention recall and could be employed where reliability is paramount.

LLMs decide whether to abstain or not given a as promoting abstention from a safety perspective
question, calibration-based approaches would rely (Huang et al., 2023b; Liu et al., 2023b). While
on a held-out set to set a threshold over calibrated training-based approaches have shown great po-
confidence scores: such a threshold and reliance on tential, they would often require substantial com-
a held-out set could jeopardize the generalization putational resources to fine-tune LLMs and might
of calibration-based approaches across knowledge struggle to generalize across knowledge domains
domains and reasoning contexts. and contexts due to their reliance on the patterns
Prompting-based approaches leverage the and examples in the training subset.
instruction-following abilities of LLMs and employ In this work, we propose C OOPERATE and C OM -
instructions to induce self-reflection and gauge PETE , two collaboration-based approaches to en-
whether the generated answer should be trusted. able robust LLM abstention through multi-LLM
Kadavath et al. (2022) investigates whether a “none- collaboration. These approaches have shown great
of-the-above” option or self-evaluation prompting potential in improving LLM abstention in not only
would induce good estimations of LLMs’ inter- direct QA, but also retrieval-augmented QA and
nal factuality. Huang et al. (2023a) follows exist- multi-hop reasoning scenarios.
ing works (Kim et al., 2024; Shinn et al., 2023)
to evaluate whether LLMs could self-correct their 7 Conclusion
own reasoning with a three-step prompting strat-
We investigate AbstainQA, a setting where LLMs
egy. Feng et al. (2023b) prompts LLM to reflect
should abstain from answering questions incor-
on whether more information is needed for a given
rectly. We curate a taxonomy of 11 abstain base-
question to selectively solicit external knowledge.
lines across four categories and propose C OOP -
These approaches, among other prompting-based
ERATE and C OMPETE , two novel abstain mecha-
methods for acknowledging knowledge limitations
nisms that promote mechanistic reflection through
(Wang et al., 2023a; Si et al., 2023), mostly rely
multi-LLM collaboration, in cooperation or com-
on the hope that LLMs could indeed carry out self-
petition. Extensive experiments on four datasets
reflection through simple prompting, while their
demonstrate that C OOPERATE and C OMPETE ad-
ability on this front might be subject to confounders
vances the state-of-the-art in AbstainQA, with the
and spurious correlations (Kadavath et al., 2022).
potential to improve retrieval-augmented LLMs,
Training-based approaches focus on empower-
multi-hop knowledge reasoning, and more.
ing LLMs with the ability to acknowledge uncer-
tainty or abstain through training with special data Limitations
or objectives. Cobbe et al. (2021) proposes to
train an external verifier to evaluate the correct- While we primarily investigated and evaluated
ness of LLM-generated answers in math reason- LLM abstention from a knowledge perspective,
ing. A series of works also attempt to probe the the problem of abstaining is also important from a
hidden representations of LLMs for signals of fac- safety perspective, preventing LLMs from gener-
tuality and hallucination (Slobodkin et al., 2023; ating harmful content or biased responses (Huang
Azaria and Mitchell, 2023). Contemporary works et al., 2023b; Liu et al., 2023b). There are also inter-
have gained interest in improving LLMs’ factuality sections between the knowledge and safety aspects
and self-knowledge through instruction tuning and of abstention, such as abstaining from underspeci-
alignment (Zhang et al., 2023; Yang et al., 2023; fied questions based on stereotypes of demographic
Sun et al., 2023; Bashlovkina et al., 2023), as well attributes. We plan to expand on this work and pro-
14672
pose unified abstain strategies for knowledge and Ethics Statement
safety scenarios in the future.
Figure 5 highlights that LLM abstain decisions
The abstain mechanisms and experiments in this could also have fairness implications, as they fail
work are exhaustive to the extent that our compu- to abstain and assume the outcomes of future elec-
tational budget permits, while we could not cover tions more frequently for African and Asian coun-
every single approach and proposal from existing tries. We similarly expect varying levels of abstain
works. Nevertheless, we will make the evaluation performance on questions and prompts regarding
data, code, and framework publicly available so different demographics, communities, and perspec-
that new abstain approaches, as well as datasets, tives, potentially due to LLMs’ internal social bi-
can be seamlessly added to advance LLM absten- ases (Blodgett et al., 2020; Jin et al., 2021; Bender
tion research. et al., 2021; Shaikh et al., 2023; Feng et al., 2023a).
Our proposed collaboration-based approaches We plan to examine the intersections of LLM social
require prompting multiple LLMs for feedback and biases and their abstention abilities in future work,
could have more computational overhead, while the especially in critical domains such as hate speech
exact cost would depend on the inference costs of and misinformation where failing to abstain from
other LLMs. We provide a summary of inference generating problematic content could have grave
costs in Table 6 and Appendix A. We posit that consequences for marginalized communities.
the value of multi-LLM collaboration for absten-
tion lies in using smaller models to supervise larger Acknowledgements
models in judging veracity and preventing halluci- This material is based upon work supported by
nations. When we serve a user-facing LLM with the National Science Foundation under CAREER
hundreds of billions of parameters, our approach Grant No. IIS2142739, NSF Grant No. IIS2203097,
enables the incorporation of a few 7B LLMs for and the DARPA CMO under Contract No.
abstention guidance and mitigate hallucinations: HR001120C0124. Any opinions, findings and con-
having a few extra 7B models does not add greatly clusions or recommendations expressed in this ma-
to the cost of a hundreds-of-billions LLM, while terial are those of the author(s) and do not neces-
significantly boosting its factuality and reliability. sarily reflect the views of the funding agencies. We
By sampling multiple feedbacks to make abstain also gratefully acknowledge support from Alfred
decisions, randomness is also introduced and there P. Sloan Foundation Fellowship.
might be variation across model inference runs.
In this work, we focus on a conceptual “abstain”
functionality and develop robust approaches to
References
identify the knowledge gaps in large language mod- Alfonso Amayuelas, Liangming Pan, Wenhu Chen, and
els. However, “abstain” doesn’t necessarily have to William Wang. 2023. Knowledge of knowledge: Ex-
ploring known-unknowns uncertainty with large lan-
be part of the user interface of LLMs: it could be guage models. arXiv preprint arXiv:2305.13712.
an internal component of an LLM-based system,
followed by approaches to find better knowledge Akari Asai, Zeqiu Wu, Yizhong Wang, Avirup Sil, and
Hannaneh Hajishirzi. 2023. Self-rag: Learning to
sources to fill the knowledge gaps or methods to retrieve, generate, and critique through self-reflection.
incorporate the uncertainty to model responses to In The Twelfth International Conference on Learning
enhance trustworthiness (Zhou et al., 2024). Future Representations.
work could further explore LLM pipelines where Amos Azaria and Tom Mitchell. 2023. The internal
these abstain approaches are one component to- state of an llm knows when it’s lying. In Findings
wards reliable question answering. of the Association for Computational Linguistics:
EMNLP 2023, pages 967–976.
We acknowledge the potential impact of test
set pollution in the training of the adopted LLMs. Jinheon Baek, Soyeong Jeong, Minki Kang, Jong C
Park, and Sung Hwang. 2023. Knowledge-
While we do not know the exact training composi- augmented language model verification. In Proceed-
tion of ChatGPT and the other two LLMs (Mistral- ings of the 2023 Conference on Empirical Methods
7B, LLaMA2-70B) are not explicitly trained on test in Natural Language Processing, pages 1720–1736.
sets, they might still be unintentionally included in Rachit Bansal, Bidisha Samanta, Siddharth Dalmia, Ni-
the training data. Our results and findings should tish Gupta, Sriram Ganapathy, Abhishek Bapna, Pra-
be thus interpreted with this limitation in mind. teek Jain, and Partha Talukdar. 2023. Llm augmented
14673
llms: Expanding capabilities through composition. Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian,
In The Twelfth International Conference on Learning Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias
Representations. Plappert, Jerry Tworek, Jacob Hilton, Reiichiro
Nakano, Christopher Hesse, and John Schulman.
Vasilisa Bashlovkina, Zhaobin Kuang, Riley Matthews, 2021. Training verifiers to solve math word prob-
Edward Clifford, Yennie Jun, William W Cohen, and lems. arXiv preprint arXiv:2110.14168.
Simon Baumgartner. 2023. Trusted source align-
ment in large language models. arXiv preprint Roi Cohen, Mor Geva, Jonathan Berant, and Amir
arXiv:2311.06697. Globerson. 2023. Crawling the internal knowledge-
base of language models. In Findings of the Asso-
Emily M Bender, Timnit Gebru, Angelina McMillan- ciation for Computational Linguistics: EACL 2023,
Major, and Shmargaret Shmitchell. 2021. On the pages 1811–1824.
dangers of stochastic parrots: Can language models
be too big? In Proceedings of the 2021 ACM confer- Nicola De Cao, Wilker Aziz, and Ivan Titov. 2021. Edit-
ence on fairness, accountability, and transparency, ing factual knowledge in language models. In Pro-
pages 610–623. ceedings of the 2021 Conference on Empirical Meth-
ods in Natural Language Processing, pages 6491–
Su Lin Blodgett, Solon Barocas, Hal Daumé III, and 6506.
Hanna Wallach. 2020. Language (technology) is
power: A critical survey of “bias” in NLP. In Pro- Shrey Desai and Greg Durrett. 2020. Calibration of
ceedings of the 58th Annual Meeting of the Asso- pre-trained transformers. In Proceedings of the 2020
ciation for Computational Linguistics, pages 5454– Conference on Empirical Methods in Natural Lan-
5476, Online. Association for Computational Lin- guage Processing (EMNLP), pages 295–302, Online.
guistics. Association for Computational Linguistics.
Sebastian Borgeaud, Arthur Mensch, Jordan Hoff- Wenxuan Ding, Shangbin Feng, Yuhan Liu, Zhaoxuan
mann, Trevor Cai, Eliza Rutherford, Katie Milli- Tan, Vidhisha Balachandran, Tianxing He, and Yulia
can, George Bm Van Den Driessche, Jean-Baptiste Tsvetkov. 2023. Knowledge crosswords: Geomet-
Lespiau, Bogdan Damoc, Aidan Clark, Diego ric reasoning over structured knowledge with large
De Las Casas, Aurelia Guy, Jacob Menick, Roman language models. arXiv preprint arXiv:2310.01290.
Ring, Tom Hennigan, Saffron Huang, Loren Mag-
giore, Chris Jones, Albin Cassirer, Andy Brock, Yilun Du, Shuang Li, Antonio Torralba, Joshua B Tenen-
Michela Paganini, Geoffrey Irving, Oriol Vinyals, baum, and Igor Mordatch. 2023. Improving factual-
Simon Osindero, Karen Simonyan, Jack Rae, Erich ity and reasoning in language models through multia-
Elsen, and Laurent Sifre. 2022. Improving language gent debate. arXiv preprint arXiv:2305.14325.
models by retrieving from trillions of tokens. In
Proceedings of the 39th International Conference Shangbin Feng, Chan Young Park, Yuhan Liu, and Yulia
on Machine Learning, volume 162 of Proceedings Tsvetkov. 2023a. From pretraining data to language
of Machine Learning Research, pages 2206–2240. models to downstream tasks: Tracking the trails of
PMLR. political biases leading to unfair NLP models. In
Proceedings of the 61st Annual Meeting of the As-
Tom Brown, Benjamin Mann, Nick Ryder, Melanie sociation for Computational Linguistics (Volume 1:
Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Long Papers), pages 11737–11762, Toronto, Canada.
Neelakantan, Pranav Shyam, Girish Sastry, Amanda Association for Computational Linguistics.
Askell, Sandhini Agarwal, Ariel Herbert-Voss,
Gretchen Krueger, Tom Henighan, Rewon Child, Shangbin Feng, Weijia Shi, Yuyang Bai, Vidhisha Bal-
Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens achandran, Tianxing He, and Yulia Tsvetkov. 2023b.
Winter, Chris Hesse, Mark Chen, Eric Sigler, Ma- Knowledge card: Filling llms’ knowledge gaps with
teusz Litwin, Scott Gray, Benjamin Chess, Jack plug-in specialized language models. In The Twelfth
Clark, Christopher Berner, Sam McCandlish, Alec International Conference on Learning Representa-
Radford, Ilya Sutskever, and Dario Amodei. 2020. tions.
Language models are few-shot learners. In Ad-
vances in Neural Information Processing Systems, Xingyu Fu, Sheng Zhang, Gukyeong Kwon, Pramu-
volume 33, pages 1877–1901. Curran Associates, ditha Perera, Henghui Zhu, Yuhao Zhang, Alexan-
Inc. der Hanbo Li, William Yang Wang, Zhiguo Wang,
Vittorio Castelli, Patrick Ng, Dan Roth, and Bing
Sky CH-Wang, Benjamin Van Durme, Jason Eisner, Xiang. 2023. Generate then select: Open-ended vi-
and Chris Kedzie. 2023. Do androids know they’re sual question answering guided by world knowledge.
only dreaming of electric sheep? arXiv preprint In Findings of the Association for Computational
arXiv:2312.17249. Linguistics: ACL 2023, pages 2333–2346, Toronto,
Canada. Association for Computational Linguistics.
Tong Chen, Hongwei Wang, Sihao Chen, Wenhao
Yu, Kaixin Ma, Xinran Zhao, Dong Yu, and Hong- Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q Wein-
ming Zhang. 2023. Dense x retrieval: What re- berger. 2017. On calibration of modern neural net-
trieval granularity should we use? arXiv preprint works. In International conference on machine learn-
arXiv:2312.06648. ing, pages 1321–1330. PMLR.
14674
Kelvin Guu, Kenton Lee, Zora Tung, Panupong Pasu- Zhengbao Jiang, Jun Araki, Haibo Ding, and Graham
pat, and Mingwei Chang. 2020. Retrieval augmented Neubig. 2021. How can we know when language
language model pre-training. In International confer- models know? on the calibration of language models
ence on machine learning, pages 3929–3938. PMLR. for question answering. Transactions of the Associa-
tion for Computational Linguistics, 9:962–977.
Guande He, Peng Cui, Jianfei Chen, Wenbo Hu, and Jun
Zhu. 2023. Investigating uncertainty calibration of Xisen Jin, Francesco Barbieri, Brendan Kennedy, Aida
aligned language models under the multiple-choice Mostafazadeh Davani, Leonardo Neves, and Xiang
setting. arXiv preprint arXiv:2310.11732. Ren. 2021. On transferability of bias mitigation ef-
fects in language model fine-tuning. In Proceedings
Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, of the 2021 Conference of the North American Chap-
Mantas Mazeika, Dawn Song, and Jacob Steinhardt. ter of the Association for Computational Linguistics:
2020. Measuring massive multitask language under- Human Language Technologies, pages 3770–3783,
standing. In International Conference on Learning Online. Association for Computational Linguistics.
Representations.
Evan Hernandez, Belinda Z Li, and Jacob Andreas. Saurav Kadavath, Tom Conerly, Amanda Askell, T. J.
2023. Measuring and manipulating knowledge rep- Henighan, Dawn Drain, Ethan Perez, Nicholas
resentations in language models. arXiv preprint Schiefer, Zachary Dodds, Nova DasSarma, Eli Tran-
arXiv:2304.00740. Johnson, Scott Johnston, Sheer El-Showk, Andy
Jones, Nelson Elhage, Tristan Hume, Anna Chen,
Shengding Hu, Yifan Luo, Huadong Wang, Xingyi Yuntao Bai, Sam Bowman, Stanislav Fort, Deep
Cheng, Zhiyuan Liu, and Maosong Sun. 2023. Won’t Ganguli, Danny Hernandez, Josh Jacobson, John
get fooled again: Answering questions with false Kernion, Shauna Kravec, Liane Lovitt, Kamal
premises. In Proceedings of the 61st Annual Meet- Ndousse, Catherine Olsson, Sam Ringer, Dario
ing of the Association for Computational Linguistics Amodei, Tom B. Brown, Jack Clark, Nicholas Joseph,
(Volume 1: Long Papers), pages 5626–5643, Toronto, Benjamin Mann, Sam McCandlish, Christopher Olah,
Canada. Association for Computational Linguistics. and Jared Kaplan. 2022. Language models (mostly)
know what they know. ArXiv, abs/2207.05221.
Jie Huang, Xinyun Chen, Swaroop Mishra,
Huaixiu Steven Zheng, Adams Wei Yu, Xiny- Adam Tauman Kalai and Santosh S Vempala. 2023.
ing Song, and Denny Zhou. 2023a. Large language Calibrated language models must hallucinate. arXiv
models cannot self-correct reasoning yet. In The preprint arXiv:2311.14648.
Twelfth International Conference on Learning
Representations. Amita Kamath, Robin Jia, and Percy Liang. 2020. Se-
lective question answering under domain shift. In
Xiaowei Huang, Wenjie Ruan, Wei Huang, Gao Jin, Proceedings of the 58th Annual Meeting of the Asso-
Yizhen Dong, Changshun Wu, Saddek Bensalem, ciation for Computational Linguistics, pages 5684–
Ronghui Mu, Yi Qi, Xingyu Zhao, Kaiwen Cai, Yang- 5696, Online. Association for Computational Lin-
hao Zhang, Sihao Wu, Peipei Xu, Dengyu Wu, André guistics.
Freitas, and Mustafa A. Mustafa. 2023b. A survey of
safety and trustworthiness of large language models Nikhil Kandpal, Haikang Deng, Adam Roberts, Eric
through the lens of verification and validation. ArXiv, Wallace, and Colin Raffel. 2023. Large language
abs/2305.11391. models struggle to learn long-tail knowledge. In In-
Abhyuday Jagannatha and Hong Yu. 2020. Calibrat- ternational Conference on Machine Learning, pages
ing structured output predictors for natural language 15696–15707. PMLR.
processing. In Proceedings of the 58th Annual Meet-
ing of the Association for Computational Linguistics, Junmo Kang, Hongyin Luo, Yada Zhu, James Glass,
pages 2078–2092, Online. Association for Computa- David Cox, Alan Ritter, Rogerio Feris, and Leonid
tional Linguistics. Karlinsky. 2023. Self-specialization: Uncovering
latent expertise within large language models. arXiv
Ziwei Ji, Nayeon Lee, Rita Frieske, Tiezheng Yu, Dan preprint arXiv:2310.00160.
Su, Yan Xu, Etsuko Ishii, Ye Jin Bang, Andrea
Madotto, and Pascale Fung. 2023. Survey of halluci- Jungo Kasai, Keisuke Sakaguchi, Ronan Le Bras, Akari
nation in natural language generation. ACM Comput- Asai, Xinyan Yu, Dragomir Radev, Noah A Smith,
ing Surveys, 55(12):1–38. Yejin Choi, Kentaro Inui, et al. 2024. Realtime qa:
What’s the answer right now? Advances in Neural
Albert Qiaochu Jiang, Alexandre Sablayrolles, Arthur Information Processing Systems, 36.
Mensch, Chris Bamford, Devendra Singh Chap-
lot, Diego de Las Casas, Florian Bressand, Gi- Omar Khattab, Keshav Santhanam, Xiang Lisa
anna Lengyel, Guillaume Lample, Lucile Saulnier, Li, David Hall, Percy Liang, Christopher Potts,
L’elio Renard Lavaud, Marie-Anne Lachaux, Pierre and Matei Zaharia. 2022. Demonstrate-search-
Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, predict: Composing retrieval and language mod-
Timothée Lacroix, and William El Sayed. 2023. Mis- els for knowledge-intensive nlp. arXiv preprint
tral 7b. ArXiv, abs/2310.06825. arXiv:2212.14024.
14675
Geunwoo Kim, Pierre Baldi, and Stephen McAleer. Sewon Min, Julian Michael, Hannaneh Hajishirzi, and
2024. Language models can solve computer tasks. Luke Zettlemoyer. 2020. AmbigQA: Answering am-
Advances in Neural Information Processing Systems, biguous open-domain questions. In Proceedings of
36. the 2020 Conference on Empirical Methods in Natu-
ral Language Processing (EMNLP).
Lingkai Kong, Haoming Jiang, Yuchen Zhuang, Jie
Lyu, Tuo Zhao, and Chao Zhang. 2020. Cali- Reiichiro Nakano, Jacob Hilton, Suchir Balaji, Jeff Wu,
brated language model fine-tuning for in- and out- Ouyang Long, Christina Kim, Christopher Hesse,
of-distribution data. In Proceedings of the 2020 Con- Shantanu Jain, Vineet Kosaraju, William Saunders,
ference on Empirical Methods in Natural Language Xu Jiang, Karl Cobbe, Tyna Eloundou, Gretchen
Processing (EMNLP), pages 1326–1340, Online. As- Krueger, Kevin Button, Matthew Knight, Benjamin
sociation for Computational Linguistics. Chess, and John Schulman. 2021. Webgpt: Browser-
assisted question-answering with human feedback.
Lorenz Kuhn, Yarin Gal, and Sebastian Farquhar. 2022. ArXiv, abs/2112.09332.
Semantic uncertainty: Linguistic invariances for un-
certainty estimation in natural language generation. Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida,
In The Eleventh International Conference on Learn- Carroll Wainwright, Pamela Mishkin, Chong Zhang,
ing Representations. Sandhini Agarwal, Katarina Slama, Alex Ray, et al.
2022. Training language models to follow instruc-
Percy Liang, Rishi Bommasani, Tony Lee, Dimitris tions with human feedback. Advances in Neural
Tsipras, Dilara Soylu, Michihiro Yasunaga, Yian Information Processing Systems, 35:27730–27744.
Zhang, Deepak Narayanan, Yuhuai Wu, Ananya Ku-
mar, et al. 2023. Holistic evaluation of language mod- Mayur Patidar, Prayushi Faldu, Avinash Singh,
els. Transactions on Machine Learning Research. Lovekesh Vig, Indrajit Bhattacharya, and Mausam.
2023. Do I have the knowledge to answer? inves-
Stephanie Lin, Jacob Hilton, and Owain Evans. 2022. tigating answerability of knowledge base questions.
Teaching models to express their uncertainty in In Proceedings of the 61st Annual Meeting of the
words. arXiv preprint arXiv:2205.14334. Association for Computational Linguistics (Volume 1:
Long Papers), pages 10341–10357, Toronto, Canada.
Xin Liu, Muhammad Khalifa, and Lu Wang. 2023a. Association for Computational Linguistics.
Litcab: Lightweight calibration of language mod-
els on outputs of varied lengths. arXiv preprint Fabio Petroni, Tim Rocktäschel, Sebastian Riedel,
arXiv:2310.19208. Patrick Lewis, Anton Bakhtin, Yuxiang Wu, and
Alexander Miller. 2019. Language models as knowl-
Yang Liu, Yuanshun Yao, Jean-Francois Ton, Xiaoying edge bases? In Proceedings of the 2019 Confer-
Zhang, Ruocheng Guo, Hao Cheng, Yegor Klochkov, ence on Empirical Methods in Natural Language Pro-
Muhammad Faaiz Taufiq, and Hang Li. 2023b. Trust- cessing and the 9th International Joint Conference
worthy llms: a survey and guideline for evaluating on Natural Language Processing (EMNLP-IJCNLP),
large language models’ alignment. In Socially Re- pages 2463–2473.
sponsible Language Modelling Research.
Jakub Piskorski, Nicolas Stefanovitch, Giovanni
Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man- Da San Martino, and Preslav Nakov. 2023. Semeval-
dar Joshi, Danqi Chen, Omer Levy, Mike Lewis, 2023 task 3: Detecting the category, the framing, and
Luke Zettlemoyer, and Veselin Stoyanov. 2019. the persuasion techniques in online news in a multi-
Roberta: A robustly optimized bert pretraining ap- lingual setup. In Proceedings of the the 17th Interna-
proach. arXiv preprint arXiv:1907.11692. tional Workshop on Semantic Evaluation (SemEval-
2023), pages 2343–2361.
Ziyang Luo, Can Xu, Pu Zhao, Xiubo Geng, Chongyang
Tao, Jing Ma, Qingwei Lin, and Daxin Jiang. Ofir Press, Muru Zhang, Sewon Min, Ludwig Schmidt,
2023. Augmented large language models with Noah Smith, and Mike Lewis. 2023. Measuring and
parametric knowledge guiding. arXiv preprint narrowing the compositionality gap in language mod-
arXiv:2305.04757. els. In Findings of the Association for Computational
Linguistics: EMNLP 2023, pages 5687–5711, Singa-
Alex Mallen, Akari Asai, Victor Zhong, Rajarshi Das, pore. Association for Computational Linguistics.
Daniel Khashabi, and Hannaneh Hajishirzi. 2023.
When not to trust language models: Investigating Alec Radford, Jeffrey Wu, Rewon Child, David Luan,
effectiveness of parametric and non-parametric mem- Dario Amodei, Ilya Sutskever, et al. 2019. Language
ories. In Proceedings of the 61st Annual Meeting of models are unsupervised multitask learners. OpenAI
the Association for Computational Linguistics (Vol- blog, 1(8):9.
ume 1: Long Papers), pages 9802–9822.
Melanie Sclar, Yejin Choi, Yulia Tsvetkov, and Alane
Sabrina J Mielke, Arthur Szlam, Emily Dinan, and Y- Suhr. 2023. Quantifying language models’ sensitiv-
Lan Boureau. 2022. Reducing conversational agents’ ity to spurious features in prompt design or: How i
overconfidence through linguistic calibration. Trans- learned to start worrying about prompt formatting.
actions of the Association for Computational Linguis- In The Twelfth International Conference on Learning
tics, 10:857–872. Representations.
14676
Omar Shaikh, Hongxin Zhang, William Held, Michael and Christopher Manning. 2023. Just ask for cali-
Bernstein, and Diyi Yang. 2023. On second thought, bration: Strategies for eliciting calibrated confidence
let’s not think step by step! bias and toxicity in zero- scores from language models fine-tuned with human
shot reasoning. In Proceedings of the 61st Annual feedback. In Proceedings of the 2023 Conference
Meeting of the Association for Computational Lin- on Empirical Methods in Natural Language Process-
guistics (Volume 1: Long Papers), pages 4454–4470, ing, pages 5433–5442, Singapore. Association for
Toronto, Canada. Association for Computational Lin- Computational Linguistics.
guistics.
Hugo Touvron, Louis Martin, Kevin Stone, Peter Al-
Weijia Shi, Sewon Min, Michihiro Yasunaga, Min- bert, Amjad Almahairi, Yasmine Babaei, Nikolay
joon Seo, Rich James, Mike Lewis, Luke Zettle- Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti
moyer, and Wen-tau Yih. 2023. Replug: Retrieval- Bhosale, et al. 2023. Llama 2: Open founda-
augmented black-box language models. arXiv tion and fine-tuned chat models. arXiv preprint
preprint arXiv:2301.12652. arXiv:2307.09288.

Noah Shinn, Federico Cassano, Ashwin Gopinath, Boshi Wang, Xiang Yue, and Huan Sun. 2023a. Can
Karthik R Narasimhan, and Shunyu Yao. 2023. Re- ChatGPT defend its belief in truth? evaluating LLM
flexion: Language agents with verbal reinforcement reasoning via debate. In Findings of the Association
learning. In Thirty-seventh Conference on Neural for Computational Linguistics: EMNLP 2023, pages
Information Processing Systems. 11865–11881, Singapore. Association for Computa-
tional Linguistics.
Chenglei Si, Weijia Shi, Chen Zhao, Luke Zettlemoyer,
and Jordan Boyd-Graber. 2023. Getting more out Shuo Wang, Zhaopeng Tu, Shuming Shi, and Yang Liu.
of mixture of language model reasoning experts. In 2020. On the inference calibration of neural machine
Findings of the Association for Computational Lin- translation. In Proceedings of the 58th Annual Meet-
guistics: EMNLP 2023, pages 8234–8249. ing of the Association for Computational Linguistics,
pages 3070–3079, Online. Association for Computa-
Aviv Slobodkin, Omer Goldman, Avi Caciularu, Ido tional Linguistics.
Dagan, and Shauli Ravfogel. 2023. The curious case
of hallucinatory (un) answerability: Finding truths Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc V Le,
in the hidden states of over-confident large language Ed H Chi, Sharan Narang, Aakanksha Chowdhery,
models. In Proceedings of the 2023 Conference on and Denny Zhou. 2022. Self-consistency improves
Empirical Methods in Natural Language Processing, chain of thought reasoning in language models. In
pages 3607–3625. The Eleventh International Conference on Learning
Representations.
Elias Stengel-Eskin and Benjamin Van Durme. 2023.
Calibrated interpretation: Confidence estimation in Yike Wang, Shangbin Feng, Heng Wang, Weijia
semantic parsing. Transactions of the Association for Shi, Vidhisha Balachandran, Tianxing He, and Yu-
Computational Linguistics, 11:1213–1231. lia Tsvetkov. 2023b. Resolving knowledge con-
flicts in large language models. arXiv preprint
Meiqi Sun, Wilson Yan, Pieter Abbeel, and Igor Mor- arXiv:2310.00935.
datch. 2022. Quantifying uncertainty in foundation
models via ensembles. In NeurIPS 2022 Workshop Zhiruo Wang, Jun Araki, Zhengbao Jiang, Md Rizwan
on Robustness in Sequence Modeling. Parvez, and Graham Neubig. 2023c. Learning to fil-
ter context for retrieval-augmented generation. arXiv
Zhiqing Sun, Sheng Shen, Shengcao Cao, Haotian Liu, preprint arXiv:2311.08377.
Chunyuan Li, Yikang Shen, Chuang Gan, Liang-
Yan Gui, Yu-Xiong Wang, Yiming Yang, et al. 2023. Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten
Aligning large multimodal models with factually aug- Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou,
mented rlhf. arXiv preprint arXiv:2309.14525. et al. 2022. Chain-of-thought prompting elicits rea-
soning in large language models. Advances in Neural
Linwei Tao, Younan Zhu, Haolan Guo, Minjing Dong, Information Processing Systems, 35:24824–24837.
and Chang Xu. 2023. A benchmark study on cali-
bration. In The Twelfth International Conference on Peter West, Ximing Lu, Nouha Dziri, Faeze Brahman,
Learning Representations. Linjie Li, Jena D Hwang, Liwei Jiang, Jillian Fisher,
Abhilasha Ravichander, Khyathi Chandu, et al. 2023.
Nandan Thakur, Luiz Bonifacio, Xinyu Zhang, The generative ai paradox:“what it can create, it may
Odunayo Ogundepo, Ehsan Kamalloo, David not understand”. In The Twelfth International Con-
Alfonso-Hermelo, Xiaoguang Li, Qun Liu, Boxing ference on Learning Representations.
Chen, Mehdi Rezagholizadeh, et al. 2023. Nomiracl:
Knowing when you don’t know for robust multilin- Spencer Whitehead, Suzanne Petryk, Vedaad Shakib,
gual retrieval-augmented generation. arXiv preprint Joseph Gonzalez, Trevor Darrell, Anna Rohrbach,
arXiv:2312.11361. and Marcus Rohrbach. 2022. Reliable visual ques-
tion answering: Abstain rather than answer incor-
Katherine Tian, Eric Mitchell, Allan Zhou, Archit rectly. In European Conference on Computer Vision,
Sharma, Rafael Rafailov, Huaxiu Yao, Chelsea Finn, pages 148–166. Springer.
14677
Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Across Datasets Across LMs 80
74.8 62.4 47.6 57.0

ChatGPT LLaMA2-70B Mistral-7B


KC MMLU
Chaumond, Clement Delangue, Anthony Moi, Pier- 69.5 67.5 67.7 75
ric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, 70
and Jamie Brew. 2019. Huggingface’s transformers: 65.9 71.0 53.9 58.5 65
68.3 63.0 60.1

train
State-of-the-art natural language processing. ArXiv, 60
abs/1910.03771. 72.9 67.2 75.1 61.5 55

HS
50
Jian Xie, Kai Zhang, Jiangjie Chen, Renze Lou, and 63.0 44.9 42.7 64.5 62.5 66.0 74.8 45

PG
Yu Su. 2023. Adaptive chameleon or stubborn sloth: 40
Revealing the behavior of large language models in MMLU KC HS PG Mistral-7BLLaMA2-70B ChatGPT
test test
knowledge conflicts. In The Twelfth International
Conference on Learning Representations. Figure 8: Performance of INSTRUCTION TUNING when
trained with one dataset/LLM and tested for another.
Fangyuan Xu, Weijia Shi, and Eunsol Choi. 2023. Re- Training on a held-out set harms the generalization of
comp: Improving retrieval-augmented lms with com-
abstain accuracy across tasks and LLMs.
pression and selective augmentation. arXiv preprint
arXiv:2310.04408.
fail in 1/3 hops fail in 2/3 hops fail in 3/3 hops
Yuqing Yang, Ethan Chern, Xipeng Qiu, Graham Neu- abstain in 1/3 hops 351 153 12
big, and Pengfei Liu. 2023. Alignment for honesty. abstain in 2/3 hops 91 512 63
arXiv preprint arXiv:2312.07000. abstain in 3/3 hops 30 20 156

Jifan Yu, Xiaozhi Wang, Shangqing Tu, Shulin Cao, Table 4: Alignment between model abstain decisions
Daniel Zhang-Li, Xin Lv, Hao Peng, Zijun Yao, Xi- and incorrect answers regarding the number of hops in
aohan Zhang, Hanming Li, et al. 2023a. Kola: Care- multi-hop QA. C OMPETE shows 73.4% match between
fully benchmarking world knowledge of large lan- abstain decisions and model failures.
guage models. In The Twelfth International Confer-
ence on Learning Representations.
Wenhao Yu, Hongming Zhang, Xiaoman Pan, Kaixin A Analysis (cont.)
Ma, Hongwei Wang, and Dong Yu. 2023b. Chain-of-
Retrieval Failure (cont.) We present more re-
note: Enhancing robustness in retrieval-augmented
language models. arXiv preprint arXiv:2311.09210. sults with the two-step abstention in retrieval-
augmented LLMs in Figure 10. Across three
Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali LLMs and four datasets, the abstain-retrieve-
Farhadi, and Yejin Choi. 2019. Hellaswag: Can a
machine really finish your sentence? In Proceedings abstain pipeline successfully brings down the in-
of the 57th Annual Meeting of the Association for correct rate and accounts for retrieval failure.
Computational Linguistics, pages 4791–4800.
Abstain and Multi-Hop (cont.) We present
Hanning Zhang, Shizhe Diao, Yong Lin, Yi R Fung, more results with the multi-hop abstention in Fig-
Qing Lian, Xingyao Wang, Yangyi Chen, Heng Ji, ure 11. it is demonstrated that our proposed C O -
and Tong Zhang. 2023. R-tuning: Teaching large
language models to refuse unknown questions. arXiv OPERATE and C OMPETE are consistently better at
preprint arXiv:2311.09677. pinpointing knowledge limitations in specific rea-
soning steps across LLMs and datasets. We further
Han Zhou, Xingchen Wan, Lev Proleev, Diana Mincu,
Jilin Chen, Katherine Heller, and Subhrajit Roy.
investigate the alignment between model abstain
2023a. Batch calibration: Rethinking calibration for decisions and model answer incorrectness: how
in-context learning and prompt engineering. arXiv many hops did the LLM abstain for and how many
preprint arXiv:2309.17249. hops did the LLM actually answer incorrectly. Ta-
Kaitlyn Zhou, Jena D Hwang, Xiang Ren, and Maarten ble 4 indicates that model abstain decisions match
Sap. 2024. Relying on the unreliable: The impact of model failures in 73.4% of the time, showcasing
language models’ reluctance to express uncertainty. the effectiveness of multi-LLM collaboration for
arXiv preprint arXiv:2401.06730. abstention in multi-hop problems.
Kaitlyn Zhou, Dan Jurafsky, and Tatsunori Hashimoto. Held-Out Sets Harms Generalization In the
2023b. Navigating the grey area: How expressions
of uncertainty and overconfidence affect language study we find that most of the baselines would re-
models. In Proceedings of the 2023 Conference on quire a held-out set of questions for model tuning,
Empirical Methods in Natural Language Processing, hyperparameter optimization, and more. Specifi-
pages 5506–5524, Singapore. Association for Com- cally, while INSTRUCTION TUNING is one of the
putational Linguistics.
strongest baselines in Section 4, its reliance on a
held-out set for training might jeopardize its gen-
14678
Method R-Acc ER A-Acc A-F1
C OOP - SELF -1 .540 .040 .544 .543
C OOP - SELF -2 .564 .035 .498 .553
C OOP - SELF -3 .562 .032 .504 .562
C OMPETE -1 .672 .088 .590 .671
C OMPETE -2 .698 .126 .623 .680
C OMPETE -3 .655 .077 .577 .662

Table 5: Performance of C OOPERATE and C OMPETE


with variations in the prompt format, on the MMLU
dataset with the Mistral-7B language model.

Scaling Laws We investigate whether LLM ab-


stention with these mechanisms could be improved
through scaling, by evaluating three abstain mech-
anisms with varying sizes of the LL A MA2 LLM.
Results in Figure 9 demonstrate that the abstain per-
formance is not steadily increasing with model size,
showcasing that empowering LLMs with absten-
tion abilities is not solely solvable through scaling
up.

Shift in Knowledge Domains We investigate


whether LLM abstention abilities could be im-
pacted by which knowledge domain the given ques-
Figure 9: Performance with three abstain mechanisms tion is about. Specifically, we present the abstain
with increasing size of model parameters. accuracy metric for different knowledge domains
in the MMLU dataset in Figure 12. It is demon-
strated that the abstention decisions are indeed of
eralization across knowledge domains and LLMs. varying quality given the knowledge domain, from
To this end, we conduct a generalization study in 25% to 87.5%, potentially informed by the LLM’s
Figure 8 where there is a transfer across datasets or underlying knowledge abilities regarding that do-
LLMs between training and testing. It is illustrated main.
that INSTRUCTION TUNING approaches struggle to
generalize across knowledge domains and LLMs, Prompt Robustness Since LLMs could be sen-
resulting in performance drops sometimes as large sitive to minor changes in prompts (Sclar et al.,
as 33.8% in abstain accuracy. On the contrary, our 2023), we experiment with minor paraphrases of
proposed collaboration-based approaches do not the prompts used in C OOPERATE and C OMPETE
need a held-out set and will not suffer from these and present the results in Table 5. It is demonstrated
generalization limitations. that the proposed abstain mechanisms are relatively
robust to minor changes in prompt phrasing.
Minority Opinion By default, we have k = 3
feedback passages for each final judge LLM to con- LLM Overhead Different abstain mechanisms
sider. We find that on the MMLU dataset, the final have different computational overhead and the most
judge follows the majority opinion in generated expensive part is the number of LLM inference
feedbacks 84.2% of the time. We manually exam- requests for each question. We present a summary
ine 10 examples where the judge employed the mi- in Table 6.
nority opinion and find that in 8 of the 10 cases, the
minority opinion indeed presents new information Working Examples We present qualitative ex-
and should be thus taken into account, showcasing amples of C OOPERATE (Tables 14 and 15 for self,
the benefit of having multiple feedbacks through Tables 16 and 17 for others) and C OMPETE (Tables
multi-LLM collaboration. 18, 19, 20, and 21).
14679
Method # Method # evaluation to remove low-quality or ambiguous
NOTA 2 REFLECT 2 questions and create a held-out set of 67 and a
M ORE I NFO 2 P ROBS 3 test set of 200 questions.
SC T HRES . k+1 A SK C ALI . 6
T EMP. 3 V ERIFIER 3 B.2 Model Details
I NSTRUCT 2+ft E MBEDDING 4
G EN +M ATCH 3 C OOP - SELF 8 We employ three large language models to
C OOP - OTHERS 2+o C OMPETE 2+o evaluate abstain mechanisms: 1) Mistral-7B,
through the MISTRALAI /M ISTRAL -7 B -I NSTRUCT-
Table 6: Overhead of LLM inference requests for dif- V 0.1 checkpoint on Huggingface (Wolf et al.,
ferent abstain mechanisms. “ft” denotes additional fine- 2019); 2) LLaMA2-70B, through the META -
tuning and “o” denotes that other reviewer LLMs are
LLAMA /LL A MA-2-70 B - CHAT- HF checkpoint on
also called once.
Huggingface; 3) ChatGPT, through the GPT-3.5-
TURBO - INSTRUCT checkpoint on OpenAI API.
B Experiment Details
B.3 Implementation Details
B.1 Dataset Details for every single baseline and approach, along with
We employ six datasets in the experiments, four ECE score extraction, prompt for prompting-based
for the main experiments, and two abstain absolute and proposed approaches
datasets.
• Token Probability: The token probability of
• MMLU (Hendrycks et al., 2020): we randomly the answer token(s) is employed as p(a). The
downsample the official validation and test splits abstain likelihood is obtained through 1 − p(a).
into 1,000 questions each as the held-out set and
• Temperature Scaling: We search for an optimal
test set.
τ value from 0.1 to 10 on the held-out set H. The
• K-Crosswords (Ding et al., 2023): we follow abstain likelihood is obtained through 1 − p(a)
the official splits to employ the validation set of after temperature scaling.
1,094 questions as the held-out set and the test
• Ask for Calibration: We follow Tian et al.
set of 1,007 questions.
(2023) and employ the prompt in Table 7 to elicit
• Hellaswag (Zellers et al., 2019): we randomly verbalized confidence scores. The abstain likeli-
sample the official validation set into 1,000 ques- hood is obtained through 1−p(a) with verbalized
tions each as the held-out set and test set. confidence scores.

• Propaganda (Piskorski et al., 2023): we create • Hidden Layers: We employ the feature-
a random split of 231 questions as the held-out extraction pipeline with HuggingFace (Wolf
set and 200 questions as the test set based on et al., 2019) to extract eq . A linear layer of
the official validation split. We randomly sam- size (DIM, 2) is then employed for linear probing
ple three other propaganda tactics accompanied where DIM is the dimension of eq . The abstain
by the correct tactic to form four-way multi-hop likelihood is obtained as the ℓ = reject likelihood
choice questions. after softmax.

• External Verifier: We employ RoBERTa


• AmbigQA (Min et al., 2020): we randomly
(ROBERTA - BASE) (Liu et al., 2019) as LLM−enc
downsample the official validation and test splits
for binary classification. The abstain likelihood
into 1,000 questions each as the held-out set and
is obtained as the ℓ = reject likelihood with the
test set.
[CLS] token.
• ElectionQA23: we first collect the overview
• Instruction Tuning: We fine-tune ChatGPT
paragraphs of the Wikipedia pages of elections
(GPT-3.5- TURBO - INSTRUCT) with the instruc-
held in 20234 and employ ChatGPT to com-
tion tuning triplets. The instruction is “Answer
pose 10 questions for each election in a four-
the following question. If you don’t have enough
way multiple-choice format. We conduct manual
knowledge, abstain by saying ’sorry, I don’t have
4
https://en.wikipedia.org/wiki/List_of_elections_in_2023 enough knowledge to answer this question.”’ The
14680
abstain message is “sorry, I don’t have enough
knowledge to answer this question.” The abstain
likelihood is obtained as the token probability of
the answer token, or that of “sorry” if it abstains.

• Self-Reflect: Following Kadavath et al. (2022),


we employ the prompt in Table 8. The abstain
likelihood is obtained as the token probability of
choosing “B”.

• More Information: Following Feng et al.


(2023b), we employ the prompt in Table 9. The
abstain likelihood is obtained as the token proba-
bility of “yes”.

• Generate and Match: We employ the prompt in


Table 10. The abstain likelihood is obtained as
the probability of “no”, or 1− that of “yes”.

• None-of-the-Above: The abstain likelihood is


obtained as the probability of choosing the NOTA
option.

• Self-Consistency Threshold: We employ k = 5


chain-of-thought paths by default. The abstain
likelihood is obtained as 1 − plu(LLM, q, k).

• Cooperate: We employ the prompt in Tables 11


and 12 for the self and others modes. The abstain
likelihood is obtained as the token probability of
choosing “B”.

• Compete: We employ the prompt in Table 13 and


employ three alternative answers by default. The
abstain likelihood is obtained as the probability
of the new answer if it changes, or 1− that of
the original answer if it doesn’t change, then
averaged across all alternative answers.

14681
Provide your best guess for the following question. Give ONLY the guess, no other words or explanation.

For example:

Guess: <most likely guess, as short as possible; not a complete sentence, just the guess!>

The question is: <question>

[LLM-generated answer]

Provide the probability that your guess is correct. Give ONLY the probability, no other words or explanation.

For example:

Probability: <the probability between 0.0 and 1.0 that your guess is correct, without any extra commentary whatsoever; just
the probability!>
Probability: [LLM-generated probability]

Table 7: Prompt for A SK FOR C ALIBRATION.

<question>

[LLM-generated answer]
The above answer is:
A. True
B. False
The answer is [LLM-generated A/B]

Table 8: Prompt for S ELF -R EFLECT.

<question>

Do you need more information to answer this question? (Yes or No)


[LLM-generated yes/no]

Table 9: Prompt for M ORE I NFORMATION.

<question without multiple-choice options>

Proposed answer: [LLM-generated answer]

<options>
Does the proposed answer exist in the options?
[LLM-generated yes/no]

Table 10: Prompt for G ENERATE AND M ATCH.

14682
Figure 10: Model performance of without retrieval, with retrieval, and retrieval+abstain settings with the C OMPETE
approach.

14683
Figure 11: Model performance in the multi-hop knowledge reasoning setting.

14684
100 87.5 87.5 85.7 83.3 83.3
80 80.0 80.0 76.9
73.3 73.1
A-ACC (%)

60 ......
44.4 44.4 43.8 42.9 42.9
40 37.5 33.3 33.3
30.8 25.0
20
0
on ra

ter ns

_ch ts
t_a nt

clin ngin y

ati ns

y
l_c c_r s

l_k ring

e
glo ence

ry

hig al_ ics

co com ol_p s
_co r_s s

r_s y

isp gy

int _rel e
soc ce

ast law
oo bli tric

c
ge pute ysic
l_e tom

om
ute urit
dg

rld nc
ge fac
ist
ec lgeb
str me

ch eti
pu tio

ern igio
n
jur iolo
dic at

al_
wo rude
cie
wle
ica ee
ch pu ome

ron
em

mp ec
me them
h_s gen
i
_
om ela

h
ica na
ab nage

_sc

on
co ba

no
a
ac

a
ma

o
_m
lle

ge
ctr

lle
co
ele

lle
h_s
hig

Figure 12: Performance of the C OMPETE mechanism with Mistral-7B across the 57 subtasks in MMLU.

// obtain proposed answer

Question: <question>
Answer: [generated proposed answer]

// obtain feedback from self-specialized experts

for domain in ["factual information", "commonsense knowledge", "mathematical knowledge"]:


Generate some knowledge about the question, focusing on <domain>: [generated domain knowledge]

Knowledge: <generated domain knowledge>


Question: <question>
Answer: <generated proposed answer>
Please review the proposed answer and provide feedback on its correctness.
Feedback: [generated feedback]

// area-chairing for abstain decision

Question: <question>
Proposed Answer: <generated proposed answer>

Feedback 1: <generated feedback from expert 1>


...
Feedback k: <generated feedback from expert k>

Based on the feedback, the proposed answer is:


A. True
B. False
The answer is [A/B].

Table 11: Prompt for C OOPERATE-self.

14685
// obtain proposed answer

Question: <question>
Answer: [generated proposed answer]

// obtain feedback from other LLMs as reviewers


for llm in list_of_other_llms:

Question: <question>
Answer: <generated proposed answer>
Please review the proposed answer and provide feedback on its correctness.
Feedback: [generated feedback using llm]

// area-chairing for abstain decision

Question: <question>
Proposed Answer: <generated proposed answer>

Feedback 1: <generated feedback from llm 1>


...
Feedback k: <generated feedback from llm k>

Based on the feedback, the proposed answer is:


A. True
B. False
The answer is [A/B].

Table 12: Prompt for C OOPERATE-others.

// obtain proposed answer

Question: <question>
Answer: [generated proposed answer]

// obtain alternative answers


if multiple-choice:
<alternative answer> = randomly select another unchosen answer
else:
Question: <question>
Answer: <generated proposed answer>
Please propose an alternative answer: [alternative answer]

// obtain a supporting paragraph for the alternative answer

Question: <question>
Generate a knowledge paragraph about <alternative answer>: [generated alternative passage]

// challenge!

Answer the question with the following knowledge: feel free to ignore irrelevant or wrong information.

Knowledge: <generated alternative passage>


Question: <question>
Answer: [new generated answer]

// abstain decision
if <new generated answer> == <genearted proposed answer>:
abstain = False
else:
abstain = True

Table 13: Prompt for C OMPETE.

14686
Question: The following information pertains to Dash Co.’s utility bills: Period covered Amount Date paid April 16 ‚Äì May
15 $5000 June 1 May 16 – June 15 $6000 July 1 June 16 – July 15 $8000 August 1 What is the amount that Dash should
report as a liability in its June 30 balance sheet?
A: $6,000
B: $7,000
C: $10,000
D: $14,000
Choose one answer from the above choices. The answer is B: $7,000

Feedback 1: The correct answer is actually A: $6,000. The amount that Dash should report as a liability in its June 30 balance
sheet is the utility bill for the period of May 16 - June 15, which is $6,000. The subsequent utility bills for June 16 - July 15
and July 16 - August 15 will be reported as liabilities in future balance sheets.

Feedback 2: The answer is incorrect. To determine the liability that Dash should report in its June 30 balance sheet, we need
to look at the amount that is unpaid as of June 30. According to the information given, the utility bill for April 16 - May 15
was paid on June 1. Therefore, Dash should report the unpaid amount for the period May 16 - June 15 as a liability on its
June 30 balance sheet. This amount is $6,000. Thus, the correct answer is A: $6,000.

Feedback 3: The proposed answer is incorrect.

Based on the feedback, the proposed answer is:


A. True
B. False
The answer is
B. False

Table 14: Example one for C OOPERATE-self. The reviewers provided different feedback, while the area chair LLM
merged into a final abstain decision.

Question: Compared with the current in an LED lamp the current in the connecting wire is
A: less.
B: more.
C: the same.
D: Need more information.
Choose one answer from the above choices. The answer is A: less.

Feedback 1: The proposed answer, A: less, is incorrect. The answer C: the same, would be more appropriate. In an LED
lamp, the current flows from the power source through the connecting wire and into the LED. The current in the wire would
be the same as the current in the LED if there are no resistive elements in the wire causing a drop in voltage.

Feedback 2: The proposed answer is incorrect. The statement does not provide enough information to determine the current
in the connecting wire compared to the current in an LED lamp. Therefore, the correct answer is D: Need more information.

Feedback 3: The proposed answer is correct. In an LED lamp, the current in the connecting wire is typically less compared to
the current in the LED itself. This is because the LED lamp is designed to limit the amount of current flowing through the
LED to ensure proper operation and prevent damage. The connecting wire acts as a conductor to allow the flow of current
from the power source to the LED, but it does not have as high a current as the LED itself.

Based on the feedback, the proposed answer is:


A. True
B. False
The answer is
B. False.

Table 15: Example two for C OOPERATE-self. The reviewers provided different feedback, while the area chair LLM
merged into a final abstain decision.

14687
Question: The following information pertains to Dash Co.’s utility bills: Period covered Amount Date paid April 16 ‚Äì May
15 $5000 June 1 May 16 – June 15 $6000 July 1 June 16 – July 15 $8000 August 1 What is the amount that Dash should
report as a liability in its June 30 balance sheet?
A: $6,000
B: $7,000
C: $10,000
D: $14,000
Choose one answer from the above choices. The answer is B: $7,000

Feedback 1: The answer provided is not correct. The correct amount that Dash should report as a liability in its June 30
balance sheet is $6,000. This is because the amount for the period of May 16 - June 15 has already been paid on June 1, and
does not need to be reported as a liability. The only amount that is still outstanding and needs to be reported is the amount for
the period of June 16 - July 15, which is $6,000. Therefore, the correct answer is A: $6,000.

Feedback 2: The proposed answer is incorrect. Dash should report $7,000 as a liability in its June 30 balance sheet, but not
$6,000.

Here’s the reasoning:

1. The period covered by the utility bills is May 16 – June 15.
2. The amount due is $6,000, and Dash paid $5,000 on May 16 – May 15 and $1,000 on June 16 – June 15.
3. The date paid for the $6,000 bill is June 16 – June 15 and the date paid for the $1,000 bill is June 16 – June 15.
4. Since the payment of the $1,000 bill has already

Feedback 3: The correct answer is indeed B: $7,000.


The explanation is as follows:

The June 30 balance sheet should report a liability for the utility bills that have been incurred but not yet

Based on the feedback, the proposed answer is:


A. True
B. False
The answer is
B. False.

Table 16: Example one for C OOPERATE-others. The reviewers provided different feedback, while the area chair
LLM merged into a final abstain decision.

Question: Who kills Tony at the end of the film ’West Side Story’?
A: Riff
B: Chino
C: Bernardo
D: He kills himself
Choose one answer from the above choices. The answer is B: Chino

Feedback 1: The proposed answer is correct. Chino is the one who kills Tony at the end of the film ’West Side Story’.

Feedback 2: The answer provided is incorrect. In the film ’West Side Story’, it is Bernardo who kills Tony. Thank you for
providing the correct answer.

Feedback 3: The proposed answer is incorrect. Tony is killed by Bernardo, not Chino. The correct answer is C: Bernardo.

New proposed answer:


Answer: Who kills Tony at the end of the film ’West Side Story’

Based on the feedback, the proposed answer is:


A. True
B. False
The answer is
B. False. The correct answer is C: Bernardo.

Table 17: Example two for C OOPERATE-others. The reviewers provided different feedback, while the area chair
LLM merged into a final abstain decision.

14688
Question: The following information pertains to Dash Co.’s utility bills: Period covered Amount Date paid April 16 ‚Äì May
15 $5000 June 1 May 16 – June 15 $6000 July 1 June 16 – July 15 $8000 August 1 What is the amount that Dash should
report as a liability in its June 30 balance sheet?
A: $6,000
B: $7,000
C: $10,000
D: $14,000
Generate a knowledge paragraph about A.
Dash Co. should report $6,000 as a liability in its June 30 balance sheet. This is because the utility bills for the period from
May 16 to June 15 amount to $6,000. The balance sheet is a financial statement that provides an overview of a company’s
financial position at a specific point in time. Liabilities are obligations or debts owed by a company, which are classified as
current liabilities if they are expected to be settled within one year. In this case, the utility bill for the period ending on June
15 falls within the one-year timeframe and should therefore be reported as a current liability. Therefore, the correct answer is
A: $6,000.
New answer: A: $6,000.

Table 18: Example one for C OMPETE. The LLM is impacted by the alternative answer and should abstain.

Question: A flashing red traffic light signifies that a driver should do what?
A: stop
B: speed up
C: proceed with caution
D: honk the horn
Generate a knowledge paragraph about D.
D: Honking the horn is the appropriate action when encountering a flashing red traffic light. This is an important traffic rule
that drivers must follow to maintain safety on the roads. When encountering a flashing red light, it is crucial for all drivers to
be aware of traffic signals and adhere to their instructions to ensure a smooth flow of traffic and prevent any potential hazards.
Honking the horn, on the other hand, is generally used as a way to alert other drivers or pedestrians of potential dangers or to
communicate with them in certain situations.
New answer: A: stop.

Table 19: Example two for C OMPETE. The LLM is not impacted by the alternative answer.

Question: Kelsen’s theory of law is called pure theory because Kelsen:


A: Purely discussed jurisprudence only
B: Defined law in according with morality and purity
C: Separated law from religion, ethics, sociology and history
D: Discussed law purely in terms of justice
Generate a knowledge paragraph about D.
Kelsen’s theory of law is called pure theory because he discussed law purely in terms of justice. According to Kelsen, law
should be analyzed and understood independently of any moral or ethical considerations. He believed that law should be
studied and interpreted solely through its internal system of norms, without any reference to external moral values or social
contexts.

Kelsen’s pure theory of law aimed to provide a "science of law" by separating law from other disciplines such as religion,
ethics, sociology, and history. He argued that law should be treated as a self-contained system, with its own hierarchy of
norms and principles. This meant that the validity of laws should be determined based on their consistency with higher-level
norms, rather than any inherent moral or ethical qualities.

By focusing purely on the concept of justice, Kelsen attempted to provide an objective and neutral framework for understanding
and analyzing legal systems. He believed that by divorcing law from subjective moral judgments, his theory would provide a
more rigorous and scientific approach to the
New answer: D: Discussed law purely in terms of justice.

Table 20: Example three for C OMPETE. The LLM is impacted by the alternative answer and should abstain.

14689
Question: Suppose you live on the Moon. How long is a day (i.e. from sunrise to sunrise)?
A: about 18 years
B: 24 hours
C: 29 Earth days
D: a year
Generate a knowledge paragraph about A.
Option A: about 18 years

If you were to live on the Moon, a day would last about 18 years. This is due to a phenomenon called tidal locking. Tidal
locking occurs when the gravitational force between two objects, such as the Earth and the Moon, causes one object to always
face the other with the same side. In the case of the Moon, it is tidally locked to the Earth, meaning one side is constantly
facing our planet.

This tidal locking also means that the Moon’s rotation period matches its orbital period around the Earth, taking approximately
27.3 Earth days to complete one full revolution. However, despite this prolonged rotation period, it is still necessary to define
a day from a sunrise to a sunrise, just as we do on Earth.

Since the Moon does not have an atmosphere to scatter sunlight and bring about the distinct sunrise and sunset we are
accustomed to, the definition of a lunar day can be somewhat arbitrary. One way to measure it is based on
New answer: B: 24 hours.

Table 21: Example four for C OMPETE. The LLM is not impacted by the alternative answer.

14690

You might also like