2024.acl-long.786
2024.acl-long.786
14665
Cooperate Question Proposed Answer Compete Question Proposed Answer
LLM LLM
“Please review the proposed answer and provide feedback on its correctness.”
“Generate a knowledge
Alternative Alternative Alternative
Feedback 1 Feedback 2 Feedback 3 paragraph about
Knowledge 1 Knowledge 2 Knowledge 3
<alternative answer>.”
others
Feedback 1 Feedback 2 Feedback 3 “Answer the question
with the knowledge: feel New Answer 1 New Answer 2 New Answer 3
LLM 1 LLM 2 LLM 3
free to ignore irrelevant
or wrong information.”
same different different
Judge “Based on the feedback, the proposed answer is:
A. True
B. False Proposed Answer
The answer is B. False” Abstain Abstain
Figure 1: Overview of collaboration-based approaches for LLM abstention: C OOPERATE and C OMPETE.
lenges such as confirmation bases and hallucination We then instruct the LLM to answer q again
would result in unreliable self-evaluation (Kada- with conflicting information prepended: ãi =
vath et al., 2022; Ji et al., 2023; Xie et al., 2023). LLM(knowledgei , q). If a = ãi , the LLM sticks
In addition, the stronger approaches rely on held- to the original answer and should be thus accepted;
out sets and result in weaker generalization across otherwise, the LLM is swayed by conflicting in-
knowledge domains. We propose to enable robust formation generated by other LLMs, betraying its
LLM abstention through multi-LLM collabora- low confidence. This process is repeated for k al-
tion, i.e., employing multiple LLMs that interact ternative answers and the LLM should abstain if
with each other to evaluate the veracity of gener- the answer changes in a majority of cases.
ated texts and make abstain decisions (Figure 1).
3 Experiment Settings
Cooperate LLMs often come with varying
knowledge coverage with the potential of comple- Models We evaluate LLM abstention baselines
menting each other (Yu et al., 2023a; Du et al., and our proposed collaboration-based approaches
2023; Bansal et al., 2023), while the internal knowl- with three LLMs featuring varying sizes and open-
edge of one LLM could also be diversified and spe- ness: Mistral-7B (Jiang et al., 2023), LLaMA2-70B
cialized (Kang et al., 2023; Si et al., 2023). We (Touvron et al., 2023), and ChatGPT. We posit that
tap into this knowledge variation in the C OOPER - a robust abstain mechanism should ideally work
ATE approach by having different expert LLMs for all LLMs, weak and strong. We set the default
generate feedback on LLM-proposed answers and sampling temperature to 0.1, and employ 0.7 where
employing an LLM as the final judge to decide multiple runs are required.
14667
MMLU K-Crosswords Hellaswag Propaganda
Method R-Acc ER A-Acc A-F1 R-Acc ER A-Acc A-F1 R-Acc ER A-Acc A-F1 R-Acc ER A-Acc A-F1
M ISTRAL -7B
P ROBS .570 .109 .608 .456 .251 -.351 .397 .422 .456 -.041 .599 .659 .337 -.150 .590 .680
T EMP. .565 .104 .601 .426 .250 -.452 .303 .182 .451 -.047 .601 .659 .340 -.170 .585 .661
A SK C ALI . .648 .141 .639 .647 .233 -.023 .705 .825 .455 -.017 .616 .734 .231 -.035 .705 .824
H IDDEN .449 -.085 .420 .137 .104 -.788 .107 .009 .369 -.198 .424 .336 .223 -.510 .240 .084
V ERIFIER .568 .083 .586 .534 .208 -.083 .805 .889 .405 -.080 .550 .628 .448 -.015 .760 .853
I NSTRUCT .709 .198 .688 .693 - - - - .616 .075 .701 .771 - - - -
R EFLECT .498 -.002 .495 .477 .111 -.686 .203 .207 .371 -.150 .477 .500 .216 -.275 .490 .602
M ORE I NFO .504 .006 .500 .216 .272 -.455 .273 .003 .379 -.227 .396 .120 .219 -.540 .240 .073
G EN +M ATCH .511 .021 .515 .129 .111 -.694 .193 .188 .377 -.174 .458 .415 .226 -.545 .230 .013
NOTA .516 .029 .528 .163 .098 -.797 .102 .011 .371 -.244 .387 .105 .259 -.410 .340 .267
SC T HRES . .604 .140 .641 .551 .210 -.090 .793 .880 .493 -.004 .614 .713 .273 -.100 .685 .799
C OOP - SELF .571 .059 .564 .601 .260 -.437 .313 .311 .406 -.042 .601 .719 .297 -.095 .680 .797
C OOP - OTHERS .688 .213 .712 .692 .266 -.022 .761 .852 .626 .092 .725 .783 .182 -.140 .625 .757
C OMPETE .735 .140 .640 .700 .289 -.129 .597 .722 .573 .032 .658 .766 .302 -.055 .700 .805
LL A MA2-70B
A SK C ALI . .624 .025 .435 .568 - - - - .944 .032 .498 .649 - - - -
H IDDEN .473 -.018 .400 .446 .282 -.265 .423 .466 .507 .007 .497 .479 - - - -
V ERIFIER .665 .201 .609 .511 .443 -.056 .634 .694 .522 .009 .504 .617 .259 -.065 .755 .855
I NSTRUCT .745 .216 .628 .640 .288 .024 .606 .772 .475 -.008 .487 .615 - - - -
R EFLECT .616 .121 .529 .469 .383 -.155 .528 .537 .569 .076 .560 .530 .198 -.520 .275 .225
M ORE I NFO .590 .177 .590 .042 .306 -.382 .308 .020 .516 .032 .518 .044 - - - -
G EN +M ATCH .667 .050 .450 .560 .248 -.111 .573 .708 .484 -.004 .477 .614 .082 -.205 .620 .759
NOTA .592 .167 .583 .181 .323 -.295 .388 .280 .516 .027 .522 .236 .185 -.580 .225 .124
SC T HRES . .684 .247 .656 .534 .426 -.090 .590 .617 .667 .100 .590 .655 .412 -.030 .760 .852
C OOP - SELF .615 .150 .550 .400 .463 -.030 .640 .714 .649 .110 .600 .643 .222 -.250 .500 .615
C OOP - OTHERS .694 .262 .676 .562 .402 -.063 .636 .757 .700 .238 .704 .677 .329 -.125 .675 .774
C OMPETE .782 .148 .552 .608 .323 -.080 .642 .760 .611 .047 .525 .625 .161 -.210 .595 .729
C HAT GPT
P ROBS .774 .421 .715 .457 .600 .187 .587 .122 .750 .278 .599 .476 .333 -.015 .625 .765
T EMP. .769 .419 .716 .452 .616 .214 .619 .216 .750 .278 .595 .468 .250 -.010 .630 .772
A SK C ALI . .694 .385 .690 .006 .601 .202 .601 .010 .672 .344 .672 .006 .444 -.015 .580 .712
V ERIFIER .788 .301 .599 .483 - - - - .667 .305 .634 .120 .441 -.035 .570 .672
I NSTRUCT .840 .422 .748 .579 .752 .310 .709 .627 .817 .429 .751 .614 .605 .045 .645 .744
R EFLECT .752 .336 .630 .411 .784 .239 .641 .633 .754 .377 .701 .487 .571 .015 .615 .742
M ORE I NFO .721 .246 .546 .390 .605 .145 .553 .380 .675 .224 .548 .339 .416 -.145 .470 .293
G EN +M ATCH .737 .350 .652 .383 .660 .083 .486 .550 .712 .182 .506 .447 .365 -.115 .490 .568
NOTA .719 .389 .692 .260 .644 .163 .565 .480 .689 .307 .628 .268 .400 -.120 .485 .488
SC T HRES . .766 .424 .743 .447 .637 .216 .622 .382 .749 .366 .688 .468 .399 -.160 .440 .309
C OOP - SELF .841 .436 .726 .578 .794 .175 .646 .646 .878 .344 .670 .628 .684 .070 .710 .802
C OOP - OTHERS .780 .362 .660 .479 .659 .109 .509 .536 .790 .350 .676 .565 .790 .321 .647 .543
C OMPETE .947 .306 .602 .583 .875 .034 .441 .589 .939 .172 .490 .545 .611 .040 .670 .795
Table 1: Performance of abstain strategies on four datasets and three LLMs. Best results in bold and second best in
underline. Approaches are color-coded per category: calibration, training, prompting, consistency, and collaboration.
Certain incompatible cases, e.g. EMBEDDING with the black-box C HAT GPT, are omitted. “-” indicates that this
approach fails to produce meaningful abstain decisions: almost always abstain, didn’t follow instructions, etc.
C OOPERATE and C OMPETE achieve the best performance in 9 of the 12 settings in terms of reliable accuracy.
Tasks and Datasets We evaluate LLM absten- splits of held-out sets and test sets for the four
tion with four datasets spanning diverse knowl- datasets with details in Appendix B.1. We em-
edge domains and reasoning scenarios: 1) MMLU ploy LLMs to answer these questions with zero-
(Hendrycks et al., 2020) is a multiple-choice shot prompting and greedy decoding to obtain the
dataset for general knowledge QA; 2) Knowledge ground truth of whether the LLM “knows” the an-
Crosswords (Ding et al., 2023) is a structured QA swer and whether it should abstain. We posit that a
dataset that focuses on multi-hop and composi- robust abstain mechanism should work for knowl-
tional knowledge reasoning; 3) Hellaswag (Zellers edge tasks across domains and reasoning contexts.
et al., 2019) is a natural language inference dataset Correct Incorrect
Evaluation Metrics We
that tests commonsense knowledge and reasoning; Answered A C illustrate the four scenar-
4) Propaganda (Piskorski et al., 2023) tasks LLMs ios under AbstainQA in
B D
with identifying the 23 propaganda tactics in a Abstained
Figure 2 and employ the
long news article based on their internal knowl- following evaluation met-
Figure 2: Four out-
edge. The datasets are all in English. We create comes of AbstainQA. rics. (1) Reliable Accu-
14668
A
racy (R-Acc): A+C indicates to what extent could stention quality. For A-Acc, Mistral-7B, LLaMA2-
LLM-generated answers (not abstained) be trusted. 70B, and ChatGPT achieve 0.524, 0.537, and 0.616
i.e., out of all questions answered, how many are A-Acc scores on average across approaches and
correct? (2) Effective Reliability (ER) (White- datasets, which aligns well with their relevant su-
A−C
head et al., 2022; Si et al., 2023): A+B+C+D periority on standard benchmarks. In addition,
strikes a balance between reliability and cover- our proposed collaboration-based approaches also
age. i.e., out of all questions, what proportion work better with the strongest ChatGPT, while also
more are answered correctly than incorrectly? improving the abstention performance of weaker
A+D
(3) Abstain Accuracy (A-Acc): A+B+C+D eval- LLMs over baselines.
uates whether the abstain decisions are correct:
LLMs should abstain when it would provide an Going beyond generic QA sheds light on the
incorrect answer and vice versa. (4) Abstain F1 limitations of existing approaches. While most
(A-F1): harmonic-mean(precision, recall), where baselines could function properly on MMLU, some
precision = B+D D
and recall = C+DD
, a balanced collapse (“-” in Table 1) in other knowledge do-
metric between reliability and answer coverage. mains or reasoning contexts: A SK C ALIBRATE,
among others, struggle to follow instructions when
4 Results the prompt context is long in detecting propaganda
tactics; I NSTRUCTION T UNING collapses and pro-
We present the performance of abstain mechanisms duces 100% abstention due to imbalances in the
with three LLMs on four tasks in Table 1. held-out set in K-Crosswords where problems are
C OOPERATE and C OMPETE are state-of- much harder and the LLM should mostly abstain;
the-art abstain mechanisms. Our proposed calibration approaches struggle to set a good thresh-
collaboration-based approaches outperform the old when the model’s base performance is too
strongest baseline in 9 out of the 12 settings (3 mod- low. However, our proposed collaboration-based
els and 4 datasets), achieving an average improve- approaches work well with all four tasks, espe-
ment of 10.1% on reliable accuracy scores. We cially with the strongest ChatGPT language model,
find that C OOPERATE works better with stronger indicating its robustness and broad compatibility
models such as ChatGPT: we hypothesize that the for knowledge-intensive tasks across domains and
complexities of the multi-step feedback generation reasoning contexts.
warrant a stronger base LLM. C OMPETE empha-
5 Analysis
sizes reliability and greatly avoids wrong answers
evident in the high reliable accuracy scores: we ob- Abstain Absolute Aside from QA datasets
serve that LLMs stick to the original answers only where abstain decisions should be made based on
in the most confident scenarios, making C OMPETE the potential correctness of answers, we investi-
ideal for abstention when reliability is paramount. gate scenarios where LLMs should 100% abstain.
Specifically, we employ AmbigQA (Min et al.,
Calibration > Training > Consistency > Prompt-
2020), where LLMs should always abstain from
ing On average, the four types of approaches
ambiguous and underspecified questions. We also
achieved 0.595, 0.576, 0.553, and 0.475 A-Acc
curate ElectionQA23, a QA dataset focusing on
scores. While simple prompting-based approaches
elections in 2023 around the globe (details in Ap-
struggle to elicit self-reflection, our proposed
pendix B.1) and LLMs should abstain due to the
C OOPERATE and C OMPETE promote collabora-
temporal mismatch of training data cutoff, i.e. if
tive abstention and greatly improve performance.
the LLM has only seen training data before 2023.
Among all baselines, instruction tuning and self-
Other potential abstain absolute scenarios include
consistency threshold stand out as two strong ap-
known unknowns (Amayuelas et al., 2023) and
proaches: however, they both rely on a held-out set
questions with false/underspecified premises (Pati-
for training and hyperparameter tuning: we further
dar et al., 2023; Hu et al., 2023). Figure 3 demon-
investigate their generalizability in Section 5.
strates that our proposed C OMPETE consistently ab-
Abstain capabilities potentially correlate with stains the most in abstain absolute scenarios across
base LLM utility. Abstain accuracy (A-Acc) is three LLMs and two datasets: we observe that the
a metric independent of LLMs’ underlying perfor- proposed alternative answers in C OMPETE often
mance on a QA dataset and solely evaluates ab- provide contradictory contexts, thus the LLMs suc-
14669
Mistral-7B LLaMA2-70B ChatGPT Method Ret. R-Acc ER A-Acc A-F1
100 99.5 99.9
83.8 89.6 88.982.6
abstain (%)
74.5
AmbigQA
87.0
abstain (%)
Figure 7: Abstain precision and recall of approaches on MMLU with Mistral-7B. While most approaches feature
higher precision, C OMPETE achieves the best abstention recall and could be employed where reliability is paramount.
LLMs decide whether to abstain or not given a as promoting abstention from a safety perspective
question, calibration-based approaches would rely (Huang et al., 2023b; Liu et al., 2023b). While
on a held-out set to set a threshold over calibrated training-based approaches have shown great po-
confidence scores: such a threshold and reliance on tential, they would often require substantial com-
a held-out set could jeopardize the generalization putational resources to fine-tune LLMs and might
of calibration-based approaches across knowledge struggle to generalize across knowledge domains
domains and reasoning contexts. and contexts due to their reliance on the patterns
Prompting-based approaches leverage the and examples in the training subset.
instruction-following abilities of LLMs and employ In this work, we propose C OOPERATE and C OM -
instructions to induce self-reflection and gauge PETE , two collaboration-based approaches to en-
whether the generated answer should be trusted. able robust LLM abstention through multi-LLM
Kadavath et al. (2022) investigates whether a “none- collaboration. These approaches have shown great
of-the-above” option or self-evaluation prompting potential in improving LLM abstention in not only
would induce good estimations of LLMs’ inter- direct QA, but also retrieval-augmented QA and
nal factuality. Huang et al. (2023a) follows exist- multi-hop reasoning scenarios.
ing works (Kim et al., 2024; Shinn et al., 2023)
to evaluate whether LLMs could self-correct their 7 Conclusion
own reasoning with a three-step prompting strat-
We investigate AbstainQA, a setting where LLMs
egy. Feng et al. (2023b) prompts LLM to reflect
should abstain from answering questions incor-
on whether more information is needed for a given
rectly. We curate a taxonomy of 11 abstain base-
question to selectively solicit external knowledge.
lines across four categories and propose C OOP -
These approaches, among other prompting-based
ERATE and C OMPETE , two novel abstain mecha-
methods for acknowledging knowledge limitations
nisms that promote mechanistic reflection through
(Wang et al., 2023a; Si et al., 2023), mostly rely
multi-LLM collaboration, in cooperation or com-
on the hope that LLMs could indeed carry out self-
petition. Extensive experiments on four datasets
reflection through simple prompting, while their
demonstrate that C OOPERATE and C OMPETE ad-
ability on this front might be subject to confounders
vances the state-of-the-art in AbstainQA, with the
and spurious correlations (Kadavath et al., 2022).
potential to improve retrieval-augmented LLMs,
Training-based approaches focus on empower-
multi-hop knowledge reasoning, and more.
ing LLMs with the ability to acknowledge uncer-
tainty or abstain through training with special data Limitations
or objectives. Cobbe et al. (2021) proposes to
train an external verifier to evaluate the correct- While we primarily investigated and evaluated
ness of LLM-generated answers in math reason- LLM abstention from a knowledge perspective,
ing. A series of works also attempt to probe the the problem of abstaining is also important from a
hidden representations of LLMs for signals of fac- safety perspective, preventing LLMs from gener-
tuality and hallucination (Slobodkin et al., 2023; ating harmful content or biased responses (Huang
Azaria and Mitchell, 2023). Contemporary works et al., 2023b; Liu et al., 2023b). There are also inter-
have gained interest in improving LLMs’ factuality sections between the knowledge and safety aspects
and self-knowledge through instruction tuning and of abstention, such as abstaining from underspeci-
alignment (Zhang et al., 2023; Yang et al., 2023; fied questions based on stereotypes of demographic
Sun et al., 2023; Bashlovkina et al., 2023), as well attributes. We plan to expand on this work and pro-
14672
pose unified abstain strategies for knowledge and Ethics Statement
safety scenarios in the future.
Figure 5 highlights that LLM abstain decisions
The abstain mechanisms and experiments in this could also have fairness implications, as they fail
work are exhaustive to the extent that our compu- to abstain and assume the outcomes of future elec-
tational budget permits, while we could not cover tions more frequently for African and Asian coun-
every single approach and proposal from existing tries. We similarly expect varying levels of abstain
works. Nevertheless, we will make the evaluation performance on questions and prompts regarding
data, code, and framework publicly available so different demographics, communities, and perspec-
that new abstain approaches, as well as datasets, tives, potentially due to LLMs’ internal social bi-
can be seamlessly added to advance LLM absten- ases (Blodgett et al., 2020; Jin et al., 2021; Bender
tion research. et al., 2021; Shaikh et al., 2023; Feng et al., 2023a).
Our proposed collaboration-based approaches We plan to examine the intersections of LLM social
require prompting multiple LLMs for feedback and biases and their abstention abilities in future work,
could have more computational overhead, while the especially in critical domains such as hate speech
exact cost would depend on the inference costs of and misinformation where failing to abstain from
other LLMs. We provide a summary of inference generating problematic content could have grave
costs in Table 6 and Appendix A. We posit that consequences for marginalized communities.
the value of multi-LLM collaboration for absten-
tion lies in using smaller models to supervise larger Acknowledgements
models in judging veracity and preventing halluci- This material is based upon work supported by
nations. When we serve a user-facing LLM with the National Science Foundation under CAREER
hundreds of billions of parameters, our approach Grant No. IIS2142739, NSF Grant No. IIS2203097,
enables the incorporation of a few 7B LLMs for and the DARPA CMO under Contract No.
abstention guidance and mitigate hallucinations: HR001120C0124. Any opinions, findings and con-
having a few extra 7B models does not add greatly clusions or recommendations expressed in this ma-
to the cost of a hundreds-of-billions LLM, while terial are those of the author(s) and do not neces-
significantly boosting its factuality and reliability. sarily reflect the views of the funding agencies. We
By sampling multiple feedbacks to make abstain also gratefully acknowledge support from Alfred
decisions, randomness is also introduced and there P. Sloan Foundation Fellowship.
might be variation across model inference runs.
In this work, we focus on a conceptual “abstain”
functionality and develop robust approaches to
References
identify the knowledge gaps in large language mod- Alfonso Amayuelas, Liangming Pan, Wenhu Chen, and
els. However, “abstain” doesn’t necessarily have to William Wang. 2023. Knowledge of knowledge: Ex-
ploring known-unknowns uncertainty with large lan-
be part of the user interface of LLMs: it could be guage models. arXiv preprint arXiv:2305.13712.
an internal component of an LLM-based system,
followed by approaches to find better knowledge Akari Asai, Zeqiu Wu, Yizhong Wang, Avirup Sil, and
Hannaneh Hajishirzi. 2023. Self-rag: Learning to
sources to fill the knowledge gaps or methods to retrieve, generate, and critique through self-reflection.
incorporate the uncertainty to model responses to In The Twelfth International Conference on Learning
enhance trustworthiness (Zhou et al., 2024). Future Representations.
work could further explore LLM pipelines where Amos Azaria and Tom Mitchell. 2023. The internal
these abstain approaches are one component to- state of an llm knows when it’s lying. In Findings
wards reliable question answering. of the Association for Computational Linguistics:
EMNLP 2023, pages 967–976.
We acknowledge the potential impact of test
set pollution in the training of the adopted LLMs. Jinheon Baek, Soyeong Jeong, Minki Kang, Jong C
Park, and Sung Hwang. 2023. Knowledge-
While we do not know the exact training composi- augmented language model verification. In Proceed-
tion of ChatGPT and the other two LLMs (Mistral- ings of the 2023 Conference on Empirical Methods
7B, LLaMA2-70B) are not explicitly trained on test in Natural Language Processing, pages 1720–1736.
sets, they might still be unintentionally included in Rachit Bansal, Bidisha Samanta, Siddharth Dalmia, Ni-
the training data. Our results and findings should tish Gupta, Sriram Ganapathy, Abhishek Bapna, Pra-
be thus interpreted with this limitation in mind. teek Jain, and Partha Talukdar. 2023. Llm augmented
14673
llms: Expanding capabilities through composition. Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian,
In The Twelfth International Conference on Learning Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias
Representations. Plappert, Jerry Tworek, Jacob Hilton, Reiichiro
Nakano, Christopher Hesse, and John Schulman.
Vasilisa Bashlovkina, Zhaobin Kuang, Riley Matthews, 2021. Training verifiers to solve math word prob-
Edward Clifford, Yennie Jun, William W Cohen, and lems. arXiv preprint arXiv:2110.14168.
Simon Baumgartner. 2023. Trusted source align-
ment in large language models. arXiv preprint Roi Cohen, Mor Geva, Jonathan Berant, and Amir
arXiv:2311.06697. Globerson. 2023. Crawling the internal knowledge-
base of language models. In Findings of the Asso-
Emily M Bender, Timnit Gebru, Angelina McMillan- ciation for Computational Linguistics: EACL 2023,
Major, and Shmargaret Shmitchell. 2021. On the pages 1811–1824.
dangers of stochastic parrots: Can language models
be too big? In Proceedings of the 2021 ACM confer- Nicola De Cao, Wilker Aziz, and Ivan Titov. 2021. Edit-
ence on fairness, accountability, and transparency, ing factual knowledge in language models. In Pro-
pages 610–623. ceedings of the 2021 Conference on Empirical Meth-
ods in Natural Language Processing, pages 6491–
Su Lin Blodgett, Solon Barocas, Hal Daumé III, and 6506.
Hanna Wallach. 2020. Language (technology) is
power: A critical survey of “bias” in NLP. In Pro- Shrey Desai and Greg Durrett. 2020. Calibration of
ceedings of the 58th Annual Meeting of the Asso- pre-trained transformers. In Proceedings of the 2020
ciation for Computational Linguistics, pages 5454– Conference on Empirical Methods in Natural Lan-
5476, Online. Association for Computational Lin- guage Processing (EMNLP), pages 295–302, Online.
guistics. Association for Computational Linguistics.
Sebastian Borgeaud, Arthur Mensch, Jordan Hoff- Wenxuan Ding, Shangbin Feng, Yuhan Liu, Zhaoxuan
mann, Trevor Cai, Eliza Rutherford, Katie Milli- Tan, Vidhisha Balachandran, Tianxing He, and Yulia
can, George Bm Van Den Driessche, Jean-Baptiste Tsvetkov. 2023. Knowledge crosswords: Geomet-
Lespiau, Bogdan Damoc, Aidan Clark, Diego ric reasoning over structured knowledge with large
De Las Casas, Aurelia Guy, Jacob Menick, Roman language models. arXiv preprint arXiv:2310.01290.
Ring, Tom Hennigan, Saffron Huang, Loren Mag-
giore, Chris Jones, Albin Cassirer, Andy Brock, Yilun Du, Shuang Li, Antonio Torralba, Joshua B Tenen-
Michela Paganini, Geoffrey Irving, Oriol Vinyals, baum, and Igor Mordatch. 2023. Improving factual-
Simon Osindero, Karen Simonyan, Jack Rae, Erich ity and reasoning in language models through multia-
Elsen, and Laurent Sifre. 2022. Improving language gent debate. arXiv preprint arXiv:2305.14325.
models by retrieving from trillions of tokens. In
Proceedings of the 39th International Conference Shangbin Feng, Chan Young Park, Yuhan Liu, and Yulia
on Machine Learning, volume 162 of Proceedings Tsvetkov. 2023a. From pretraining data to language
of Machine Learning Research, pages 2206–2240. models to downstream tasks: Tracking the trails of
PMLR. political biases leading to unfair NLP models. In
Proceedings of the 61st Annual Meeting of the As-
Tom Brown, Benjamin Mann, Nick Ryder, Melanie sociation for Computational Linguistics (Volume 1:
Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Long Papers), pages 11737–11762, Toronto, Canada.
Neelakantan, Pranav Shyam, Girish Sastry, Amanda Association for Computational Linguistics.
Askell, Sandhini Agarwal, Ariel Herbert-Voss,
Gretchen Krueger, Tom Henighan, Rewon Child, Shangbin Feng, Weijia Shi, Yuyang Bai, Vidhisha Bal-
Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens achandran, Tianxing He, and Yulia Tsvetkov. 2023b.
Winter, Chris Hesse, Mark Chen, Eric Sigler, Ma- Knowledge card: Filling llms’ knowledge gaps with
teusz Litwin, Scott Gray, Benjamin Chess, Jack plug-in specialized language models. In The Twelfth
Clark, Christopher Berner, Sam McCandlish, Alec International Conference on Learning Representa-
Radford, Ilya Sutskever, and Dario Amodei. 2020. tions.
Language models are few-shot learners. In Ad-
vances in Neural Information Processing Systems, Xingyu Fu, Sheng Zhang, Gukyeong Kwon, Pramu-
volume 33, pages 1877–1901. Curran Associates, ditha Perera, Henghui Zhu, Yuhao Zhang, Alexan-
Inc. der Hanbo Li, William Yang Wang, Zhiguo Wang,
Vittorio Castelli, Patrick Ng, Dan Roth, and Bing
Sky CH-Wang, Benjamin Van Durme, Jason Eisner, Xiang. 2023. Generate then select: Open-ended vi-
and Chris Kedzie. 2023. Do androids know they’re sual question answering guided by world knowledge.
only dreaming of electric sheep? arXiv preprint In Findings of the Association for Computational
arXiv:2312.17249. Linguistics: ACL 2023, pages 2333–2346, Toronto,
Canada. Association for Computational Linguistics.
Tong Chen, Hongwei Wang, Sihao Chen, Wenhao
Yu, Kaixin Ma, Xinran Zhao, Dong Yu, and Hong- Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q Wein-
ming Zhang. 2023. Dense x retrieval: What re- berger. 2017. On calibration of modern neural net-
trieval granularity should we use? arXiv preprint works. In International conference on machine learn-
arXiv:2312.06648. ing, pages 1321–1330. PMLR.
14674
Kelvin Guu, Kenton Lee, Zora Tung, Panupong Pasu- Zhengbao Jiang, Jun Araki, Haibo Ding, and Graham
pat, and Mingwei Chang. 2020. Retrieval augmented Neubig. 2021. How can we know when language
language model pre-training. In International confer- models know? on the calibration of language models
ence on machine learning, pages 3929–3938. PMLR. for question answering. Transactions of the Associa-
tion for Computational Linguistics, 9:962–977.
Guande He, Peng Cui, Jianfei Chen, Wenbo Hu, and Jun
Zhu. 2023. Investigating uncertainty calibration of Xisen Jin, Francesco Barbieri, Brendan Kennedy, Aida
aligned language models under the multiple-choice Mostafazadeh Davani, Leonardo Neves, and Xiang
setting. arXiv preprint arXiv:2310.11732. Ren. 2021. On transferability of bias mitigation ef-
fects in language model fine-tuning. In Proceedings
Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, of the 2021 Conference of the North American Chap-
Mantas Mazeika, Dawn Song, and Jacob Steinhardt. ter of the Association for Computational Linguistics:
2020. Measuring massive multitask language under- Human Language Technologies, pages 3770–3783,
standing. In International Conference on Learning Online. Association for Computational Linguistics.
Representations.
Evan Hernandez, Belinda Z Li, and Jacob Andreas. Saurav Kadavath, Tom Conerly, Amanda Askell, T. J.
2023. Measuring and manipulating knowledge rep- Henighan, Dawn Drain, Ethan Perez, Nicholas
resentations in language models. arXiv preprint Schiefer, Zachary Dodds, Nova DasSarma, Eli Tran-
arXiv:2304.00740. Johnson, Scott Johnston, Sheer El-Showk, Andy
Jones, Nelson Elhage, Tristan Hume, Anna Chen,
Shengding Hu, Yifan Luo, Huadong Wang, Xingyi Yuntao Bai, Sam Bowman, Stanislav Fort, Deep
Cheng, Zhiyuan Liu, and Maosong Sun. 2023. Won’t Ganguli, Danny Hernandez, Josh Jacobson, John
get fooled again: Answering questions with false Kernion, Shauna Kravec, Liane Lovitt, Kamal
premises. In Proceedings of the 61st Annual Meet- Ndousse, Catherine Olsson, Sam Ringer, Dario
ing of the Association for Computational Linguistics Amodei, Tom B. Brown, Jack Clark, Nicholas Joseph,
(Volume 1: Long Papers), pages 5626–5643, Toronto, Benjamin Mann, Sam McCandlish, Christopher Olah,
Canada. Association for Computational Linguistics. and Jared Kaplan. 2022. Language models (mostly)
know what they know. ArXiv, abs/2207.05221.
Jie Huang, Xinyun Chen, Swaroop Mishra,
Huaixiu Steven Zheng, Adams Wei Yu, Xiny- Adam Tauman Kalai and Santosh S Vempala. 2023.
ing Song, and Denny Zhou. 2023a. Large language Calibrated language models must hallucinate. arXiv
models cannot self-correct reasoning yet. In The preprint arXiv:2311.14648.
Twelfth International Conference on Learning
Representations. Amita Kamath, Robin Jia, and Percy Liang. 2020. Se-
lective question answering under domain shift. In
Xiaowei Huang, Wenjie Ruan, Wei Huang, Gao Jin, Proceedings of the 58th Annual Meeting of the Asso-
Yizhen Dong, Changshun Wu, Saddek Bensalem, ciation for Computational Linguistics, pages 5684–
Ronghui Mu, Yi Qi, Xingyu Zhao, Kaiwen Cai, Yang- 5696, Online. Association for Computational Lin-
hao Zhang, Sihao Wu, Peipei Xu, Dengyu Wu, André guistics.
Freitas, and Mustafa A. Mustafa. 2023b. A survey of
safety and trustworthiness of large language models Nikhil Kandpal, Haikang Deng, Adam Roberts, Eric
through the lens of verification and validation. ArXiv, Wallace, and Colin Raffel. 2023. Large language
abs/2305.11391. models struggle to learn long-tail knowledge. In In-
Abhyuday Jagannatha and Hong Yu. 2020. Calibrat- ternational Conference on Machine Learning, pages
ing structured output predictors for natural language 15696–15707. PMLR.
processing. In Proceedings of the 58th Annual Meet-
ing of the Association for Computational Linguistics, Junmo Kang, Hongyin Luo, Yada Zhu, James Glass,
pages 2078–2092, Online. Association for Computa- David Cox, Alan Ritter, Rogerio Feris, and Leonid
tional Linguistics. Karlinsky. 2023. Self-specialization: Uncovering
latent expertise within large language models. arXiv
Ziwei Ji, Nayeon Lee, Rita Frieske, Tiezheng Yu, Dan preprint arXiv:2310.00160.
Su, Yan Xu, Etsuko Ishii, Ye Jin Bang, Andrea
Madotto, and Pascale Fung. 2023. Survey of halluci- Jungo Kasai, Keisuke Sakaguchi, Ronan Le Bras, Akari
nation in natural language generation. ACM Comput- Asai, Xinyan Yu, Dragomir Radev, Noah A Smith,
ing Surveys, 55(12):1–38. Yejin Choi, Kentaro Inui, et al. 2024. Realtime qa:
What’s the answer right now? Advances in Neural
Albert Qiaochu Jiang, Alexandre Sablayrolles, Arthur Information Processing Systems, 36.
Mensch, Chris Bamford, Devendra Singh Chap-
lot, Diego de Las Casas, Florian Bressand, Gi- Omar Khattab, Keshav Santhanam, Xiang Lisa
anna Lengyel, Guillaume Lample, Lucile Saulnier, Li, David Hall, Percy Liang, Christopher Potts,
L’elio Renard Lavaud, Marie-Anne Lachaux, Pierre and Matei Zaharia. 2022. Demonstrate-search-
Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, predict: Composing retrieval and language mod-
Timothée Lacroix, and William El Sayed. 2023. Mis- els for knowledge-intensive nlp. arXiv preprint
tral 7b. ArXiv, abs/2310.06825. arXiv:2212.14024.
14675
Geunwoo Kim, Pierre Baldi, and Stephen McAleer. Sewon Min, Julian Michael, Hannaneh Hajishirzi, and
2024. Language models can solve computer tasks. Luke Zettlemoyer. 2020. AmbigQA: Answering am-
Advances in Neural Information Processing Systems, biguous open-domain questions. In Proceedings of
36. the 2020 Conference on Empirical Methods in Natu-
ral Language Processing (EMNLP).
Lingkai Kong, Haoming Jiang, Yuchen Zhuang, Jie
Lyu, Tuo Zhao, and Chao Zhang. 2020. Cali- Reiichiro Nakano, Jacob Hilton, Suchir Balaji, Jeff Wu,
brated language model fine-tuning for in- and out- Ouyang Long, Christina Kim, Christopher Hesse,
of-distribution data. In Proceedings of the 2020 Con- Shantanu Jain, Vineet Kosaraju, William Saunders,
ference on Empirical Methods in Natural Language Xu Jiang, Karl Cobbe, Tyna Eloundou, Gretchen
Processing (EMNLP), pages 1326–1340, Online. As- Krueger, Kevin Button, Matthew Knight, Benjamin
sociation for Computational Linguistics. Chess, and John Schulman. 2021. Webgpt: Browser-
assisted question-answering with human feedback.
Lorenz Kuhn, Yarin Gal, and Sebastian Farquhar. 2022. ArXiv, abs/2112.09332.
Semantic uncertainty: Linguistic invariances for un-
certainty estimation in natural language generation. Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida,
In The Eleventh International Conference on Learn- Carroll Wainwright, Pamela Mishkin, Chong Zhang,
ing Representations. Sandhini Agarwal, Katarina Slama, Alex Ray, et al.
2022. Training language models to follow instruc-
Percy Liang, Rishi Bommasani, Tony Lee, Dimitris tions with human feedback. Advances in Neural
Tsipras, Dilara Soylu, Michihiro Yasunaga, Yian Information Processing Systems, 35:27730–27744.
Zhang, Deepak Narayanan, Yuhuai Wu, Ananya Ku-
mar, et al. 2023. Holistic evaluation of language mod- Mayur Patidar, Prayushi Faldu, Avinash Singh,
els. Transactions on Machine Learning Research. Lovekesh Vig, Indrajit Bhattacharya, and Mausam.
2023. Do I have the knowledge to answer? inves-
Stephanie Lin, Jacob Hilton, and Owain Evans. 2022. tigating answerability of knowledge base questions.
Teaching models to express their uncertainty in In Proceedings of the 61st Annual Meeting of the
words. arXiv preprint arXiv:2205.14334. Association for Computational Linguistics (Volume 1:
Long Papers), pages 10341–10357, Toronto, Canada.
Xin Liu, Muhammad Khalifa, and Lu Wang. 2023a. Association for Computational Linguistics.
Litcab: Lightweight calibration of language mod-
els on outputs of varied lengths. arXiv preprint Fabio Petroni, Tim Rocktäschel, Sebastian Riedel,
arXiv:2310.19208. Patrick Lewis, Anton Bakhtin, Yuxiang Wu, and
Alexander Miller. 2019. Language models as knowl-
Yang Liu, Yuanshun Yao, Jean-Francois Ton, Xiaoying edge bases? In Proceedings of the 2019 Confer-
Zhang, Ruocheng Guo, Hao Cheng, Yegor Klochkov, ence on Empirical Methods in Natural Language Pro-
Muhammad Faaiz Taufiq, and Hang Li. 2023b. Trust- cessing and the 9th International Joint Conference
worthy llms: a survey and guideline for evaluating on Natural Language Processing (EMNLP-IJCNLP),
large language models’ alignment. In Socially Re- pages 2463–2473.
sponsible Language Modelling Research.
Jakub Piskorski, Nicolas Stefanovitch, Giovanni
Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man- Da San Martino, and Preslav Nakov. 2023. Semeval-
dar Joshi, Danqi Chen, Omer Levy, Mike Lewis, 2023 task 3: Detecting the category, the framing, and
Luke Zettlemoyer, and Veselin Stoyanov. 2019. the persuasion techniques in online news in a multi-
Roberta: A robustly optimized bert pretraining ap- lingual setup. In Proceedings of the the 17th Interna-
proach. arXiv preprint arXiv:1907.11692. tional Workshop on Semantic Evaluation (SemEval-
2023), pages 2343–2361.
Ziyang Luo, Can Xu, Pu Zhao, Xiubo Geng, Chongyang
Tao, Jing Ma, Qingwei Lin, and Daxin Jiang. Ofir Press, Muru Zhang, Sewon Min, Ludwig Schmidt,
2023. Augmented large language models with Noah Smith, and Mike Lewis. 2023. Measuring and
parametric knowledge guiding. arXiv preprint narrowing the compositionality gap in language mod-
arXiv:2305.04757. els. In Findings of the Association for Computational
Linguistics: EMNLP 2023, pages 5687–5711, Singa-
Alex Mallen, Akari Asai, Victor Zhong, Rajarshi Das, pore. Association for Computational Linguistics.
Daniel Khashabi, and Hannaneh Hajishirzi. 2023.
When not to trust language models: Investigating Alec Radford, Jeffrey Wu, Rewon Child, David Luan,
effectiveness of parametric and non-parametric mem- Dario Amodei, Ilya Sutskever, et al. 2019. Language
ories. In Proceedings of the 61st Annual Meeting of models are unsupervised multitask learners. OpenAI
the Association for Computational Linguistics (Vol- blog, 1(8):9.
ume 1: Long Papers), pages 9802–9822.
Melanie Sclar, Yejin Choi, Yulia Tsvetkov, and Alane
Sabrina J Mielke, Arthur Szlam, Emily Dinan, and Y- Suhr. 2023. Quantifying language models’ sensitiv-
Lan Boureau. 2022. Reducing conversational agents’ ity to spurious features in prompt design or: How i
overconfidence through linguistic calibration. Trans- learned to start worrying about prompt formatting.
actions of the Association for Computational Linguis- In The Twelfth International Conference on Learning
tics, 10:857–872. Representations.
14676
Omar Shaikh, Hongxin Zhang, William Held, Michael and Christopher Manning. 2023. Just ask for cali-
Bernstein, and Diyi Yang. 2023. On second thought, bration: Strategies for eliciting calibrated confidence
let’s not think step by step! bias and toxicity in zero- scores from language models fine-tuned with human
shot reasoning. In Proceedings of the 61st Annual feedback. In Proceedings of the 2023 Conference
Meeting of the Association for Computational Lin- on Empirical Methods in Natural Language Process-
guistics (Volume 1: Long Papers), pages 4454–4470, ing, pages 5433–5442, Singapore. Association for
Toronto, Canada. Association for Computational Lin- Computational Linguistics.
guistics.
Hugo Touvron, Louis Martin, Kevin Stone, Peter Al-
Weijia Shi, Sewon Min, Michihiro Yasunaga, Min- bert, Amjad Almahairi, Yasmine Babaei, Nikolay
joon Seo, Rich James, Mike Lewis, Luke Zettle- Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti
moyer, and Wen-tau Yih. 2023. Replug: Retrieval- Bhosale, et al. 2023. Llama 2: Open founda-
augmented black-box language models. arXiv tion and fine-tuned chat models. arXiv preprint
preprint arXiv:2301.12652. arXiv:2307.09288.
Noah Shinn, Federico Cassano, Ashwin Gopinath, Boshi Wang, Xiang Yue, and Huan Sun. 2023a. Can
Karthik R Narasimhan, and Shunyu Yao. 2023. Re- ChatGPT defend its belief in truth? evaluating LLM
flexion: Language agents with verbal reinforcement reasoning via debate. In Findings of the Association
learning. In Thirty-seventh Conference on Neural for Computational Linguistics: EMNLP 2023, pages
Information Processing Systems. 11865–11881, Singapore. Association for Computa-
tional Linguistics.
Chenglei Si, Weijia Shi, Chen Zhao, Luke Zettlemoyer,
and Jordan Boyd-Graber. 2023. Getting more out Shuo Wang, Zhaopeng Tu, Shuming Shi, and Yang Liu.
of mixture of language model reasoning experts. In 2020. On the inference calibration of neural machine
Findings of the Association for Computational Lin- translation. In Proceedings of the 58th Annual Meet-
guistics: EMNLP 2023, pages 8234–8249. ing of the Association for Computational Linguistics,
pages 3070–3079, Online. Association for Computa-
Aviv Slobodkin, Omer Goldman, Avi Caciularu, Ido tional Linguistics.
Dagan, and Shauli Ravfogel. 2023. The curious case
of hallucinatory (un) answerability: Finding truths Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc V Le,
in the hidden states of over-confident large language Ed H Chi, Sharan Narang, Aakanksha Chowdhery,
models. In Proceedings of the 2023 Conference on and Denny Zhou. 2022. Self-consistency improves
Empirical Methods in Natural Language Processing, chain of thought reasoning in language models. In
pages 3607–3625. The Eleventh International Conference on Learning
Representations.
Elias Stengel-Eskin and Benjamin Van Durme. 2023.
Calibrated interpretation: Confidence estimation in Yike Wang, Shangbin Feng, Heng Wang, Weijia
semantic parsing. Transactions of the Association for Shi, Vidhisha Balachandran, Tianxing He, and Yu-
Computational Linguistics, 11:1213–1231. lia Tsvetkov. 2023b. Resolving knowledge con-
flicts in large language models. arXiv preprint
Meiqi Sun, Wilson Yan, Pieter Abbeel, and Igor Mor- arXiv:2310.00935.
datch. 2022. Quantifying uncertainty in foundation
models via ensembles. In NeurIPS 2022 Workshop Zhiruo Wang, Jun Araki, Zhengbao Jiang, Md Rizwan
on Robustness in Sequence Modeling. Parvez, and Graham Neubig. 2023c. Learning to fil-
ter context for retrieval-augmented generation. arXiv
Zhiqing Sun, Sheng Shen, Shengcao Cao, Haotian Liu, preprint arXiv:2311.08377.
Chunyuan Li, Yikang Shen, Chuang Gan, Liang-
Yan Gui, Yu-Xiong Wang, Yiming Yang, et al. 2023. Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten
Aligning large multimodal models with factually aug- Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou,
mented rlhf. arXiv preprint arXiv:2309.14525. et al. 2022. Chain-of-thought prompting elicits rea-
soning in large language models. Advances in Neural
Linwei Tao, Younan Zhu, Haolan Guo, Minjing Dong, Information Processing Systems, 35:24824–24837.
and Chang Xu. 2023. A benchmark study on cali-
bration. In The Twelfth International Conference on Peter West, Ximing Lu, Nouha Dziri, Faeze Brahman,
Learning Representations. Linjie Li, Jena D Hwang, Liwei Jiang, Jillian Fisher,
Abhilasha Ravichander, Khyathi Chandu, et al. 2023.
Nandan Thakur, Luiz Bonifacio, Xinyu Zhang, The generative ai paradox:“what it can create, it may
Odunayo Ogundepo, Ehsan Kamalloo, David not understand”. In The Twelfth International Con-
Alfonso-Hermelo, Xiaoguang Li, Qun Liu, Boxing ference on Learning Representations.
Chen, Mehdi Rezagholizadeh, et al. 2023. Nomiracl:
Knowing when you don’t know for robust multilin- Spencer Whitehead, Suzanne Petryk, Vedaad Shakib,
gual retrieval-augmented generation. arXiv preprint Joseph Gonzalez, Trevor Darrell, Anna Rohrbach,
arXiv:2312.11361. and Marcus Rohrbach. 2022. Reliable visual ques-
tion answering: Abstain rather than answer incor-
Katherine Tian, Eric Mitchell, Allan Zhou, Archit rectly. In European Conference on Computer Vision,
Sharma, Rafael Rafailov, Huaxiu Yao, Chelsea Finn, pages 148–166. Springer.
14677
Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Across Datasets Across LMs 80
74.8 62.4 47.6 57.0
train
State-of-the-art natural language processing. ArXiv, 60
abs/1910.03771. 72.9 67.2 75.1 61.5 55
HS
50
Jian Xie, Kai Zhang, Jiangjie Chen, Renze Lou, and 63.0 44.9 42.7 64.5 62.5 66.0 74.8 45
PG
Yu Su. 2023. Adaptive chameleon or stubborn sloth: 40
Revealing the behavior of large language models in MMLU KC HS PG Mistral-7BLLaMA2-70B ChatGPT
test test
knowledge conflicts. In The Twelfth International
Conference on Learning Representations. Figure 8: Performance of INSTRUCTION TUNING when
trained with one dataset/LLM and tested for another.
Fangyuan Xu, Weijia Shi, and Eunsol Choi. 2023. Re- Training on a held-out set harms the generalization of
comp: Improving retrieval-augmented lms with com-
abstain accuracy across tasks and LLMs.
pression and selective augmentation. arXiv preprint
arXiv:2310.04408.
fail in 1/3 hops fail in 2/3 hops fail in 3/3 hops
Yuqing Yang, Ethan Chern, Xipeng Qiu, Graham Neu- abstain in 1/3 hops 351 153 12
big, and Pengfei Liu. 2023. Alignment for honesty. abstain in 2/3 hops 91 512 63
arXiv preprint arXiv:2312.07000. abstain in 3/3 hops 30 20 156
Jifan Yu, Xiaozhi Wang, Shangqing Tu, Shulin Cao, Table 4: Alignment between model abstain decisions
Daniel Zhang-Li, Xin Lv, Hao Peng, Zijun Yao, Xi- and incorrect answers regarding the number of hops in
aohan Zhang, Hanming Li, et al. 2023a. Kola: Care- multi-hop QA. C OMPETE shows 73.4% match between
fully benchmarking world knowledge of large lan- abstain decisions and model failures.
guage models. In The Twelfth International Confer-
ence on Learning Representations.
Wenhao Yu, Hongming Zhang, Xiaoman Pan, Kaixin A Analysis (cont.)
Ma, Hongwei Wang, and Dong Yu. 2023b. Chain-of-
Retrieval Failure (cont.) We present more re-
note: Enhancing robustness in retrieval-augmented
language models. arXiv preprint arXiv:2311.09210. sults with the two-step abstention in retrieval-
augmented LLMs in Figure 10. Across three
Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali LLMs and four datasets, the abstain-retrieve-
Farhadi, and Yejin Choi. 2019. Hellaswag: Can a
machine really finish your sentence? In Proceedings abstain pipeline successfully brings down the in-
of the 57th Annual Meeting of the Association for correct rate and accounts for retrieval failure.
Computational Linguistics, pages 4791–4800.
Abstain and Multi-Hop (cont.) We present
Hanning Zhang, Shizhe Diao, Yong Lin, Yi R Fung, more results with the multi-hop abstention in Fig-
Qing Lian, Xingyao Wang, Yangyi Chen, Heng Ji, ure 11. it is demonstrated that our proposed C O -
and Tong Zhang. 2023. R-tuning: Teaching large
language models to refuse unknown questions. arXiv OPERATE and C OMPETE are consistently better at
preprint arXiv:2311.09677. pinpointing knowledge limitations in specific rea-
soning steps across LLMs and datasets. We further
Han Zhou, Xingchen Wan, Lev Proleev, Diana Mincu,
Jilin Chen, Katherine Heller, and Subhrajit Roy.
investigate the alignment between model abstain
2023a. Batch calibration: Rethinking calibration for decisions and model answer incorrectness: how
in-context learning and prompt engineering. arXiv many hops did the LLM abstain for and how many
preprint arXiv:2309.17249. hops did the LLM actually answer incorrectly. Ta-
Kaitlyn Zhou, Jena D Hwang, Xiang Ren, and Maarten ble 4 indicates that model abstain decisions match
Sap. 2024. Relying on the unreliable: The impact of model failures in 73.4% of the time, showcasing
language models’ reluctance to express uncertainty. the effectiveness of multi-LLM collaboration for
arXiv preprint arXiv:2401.06730. abstention in multi-hop problems.
Kaitlyn Zhou, Dan Jurafsky, and Tatsunori Hashimoto. Held-Out Sets Harms Generalization In the
2023b. Navigating the grey area: How expressions
of uncertainty and overconfidence affect language study we find that most of the baselines would re-
models. In Proceedings of the 2023 Conference on quire a held-out set of questions for model tuning,
Empirical Methods in Natural Language Processing, hyperparameter optimization, and more. Specifi-
pages 5506–5524, Singapore. Association for Com- cally, while INSTRUCTION TUNING is one of the
putational Linguistics.
strongest baselines in Section 4, its reliance on a
held-out set for training might jeopardize its gen-
14678
Method R-Acc ER A-Acc A-F1
C OOP - SELF -1 .540 .040 .544 .543
C OOP - SELF -2 .564 .035 .498 .553
C OOP - SELF -3 .562 .032 .504 .562
C OMPETE -1 .672 .088 .590 .671
C OMPETE -2 .698 .126 .623 .680
C OMPETE -3 .655 .077 .577 .662
• Propaganda (Piskorski et al., 2023): we create • Hidden Layers: We employ the feature-
a random split of 231 questions as the held-out extraction pipeline with HuggingFace (Wolf
set and 200 questions as the test set based on et al., 2019) to extract eq . A linear layer of
the official validation split. We randomly sam- size (DIM, 2) is then employed for linear probing
ple three other propaganda tactics accompanied where DIM is the dimension of eq . The abstain
by the correct tactic to form four-way multi-hop likelihood is obtained as the ℓ = reject likelihood
choice questions. after softmax.
14681
Provide your best guess for the following question. Give ONLY the guess, no other words or explanation.
For example:
Guess: <most likely guess, as short as possible; not a complete sentence, just the guess!>
[LLM-generated answer]
Provide the probability that your guess is correct. Give ONLY the probability, no other words or explanation.
For example:
Probability: <the probability between 0.0 and 1.0 that your guess is correct, without any extra commentary whatsoever; just
the probability!>
Probability: [LLM-generated probability]
<question>
[LLM-generated answer]
The above answer is:
A. True
B. False
The answer is [LLM-generated A/B]
<question>
<options>
Does the proposed answer exist in the options?
[LLM-generated yes/no]
14682
Figure 10: Model performance of without retrieval, with retrieval, and retrieval+abstain settings with the C OMPETE
approach.
14683
Figure 11: Model performance in the multi-hop knowledge reasoning setting.
14684
100 87.5 87.5 85.7 83.3 83.3
80 80.0 80.0 76.9
73.3 73.1
A-ACC (%)
60 ......
44.4 44.4 43.8 42.9 42.9
40 37.5 33.3 33.3
30.8 25.0
20
0
on ra
ter ns
_ch ts
t_a nt
clin ngin y
ati ns
y
l_c c_r s
l_k ring
e
glo ence
ry
co com ol_p s
_co r_s s
r_s y
isp gy
int _rel e
soc ce
ast law
oo bli tric
c
ge pute ysic
l_e tom
om
ute urit
dg
rld nc
ge fac
ist
ec lgeb
str me
ch eti
pu tio
ern igio
n
jur iolo
dic at
al_
wo rude
cie
wle
ica ee
ch pu ome
ron
em
mp ec
me them
h_s gen
i
_
om ela
h
ica na
ab nage
_sc
on
co ba
no
a
ac
a
ma
o
_m
lle
ge
ctr
lle
co
ele
lle
h_s
hig
Figure 12: Performance of the C OMPETE mechanism with Mistral-7B across the 57 subtasks in MMLU.
Question: <question>
Answer: [generated proposed answer]
Question: <question>
Proposed Answer: <generated proposed answer>
14685
// obtain proposed answer
Question: <question>
Answer: [generated proposed answer]
Question: <question>
Answer: <generated proposed answer>
Please review the proposed answer and provide feedback on its correctness.
Feedback: [generated feedback using llm]
Question: <question>
Proposed Answer: <generated proposed answer>
Question: <question>
Answer: [generated proposed answer]
Question: <question>
Generate a knowledge paragraph about <alternative answer>: [generated alternative passage]
// challenge!
Answer the question with the following knowledge: feel free to ignore irrelevant or wrong information.
// abstain decision
if <new generated answer> == <genearted proposed answer>:
abstain = False
else:
abstain = True
14686
Question: The following information pertains to Dash Co.’s utility bills: Period covered Amount Date paid April 16 ‚Äì May
15 $5000 June 1 May 16 – June 15 $6000 July 1 June 16 – July 15 $8000 August 1 What is the amount that Dash should
report as a liability in its June 30 balance sheet?
A: $6,000
B: $7,000
C: $10,000
D: $14,000
Choose one answer from the above choices. The answer is B: $7,000
Feedback 1: The correct answer is actually A: $6,000. The amount that Dash should report as a liability in its June 30 balance
sheet is the utility bill for the period of May 16 - June 15, which is $6,000. The subsequent utility bills for June 16 - July 15
and July 16 - August 15 will be reported as liabilities in future balance sheets.
Feedback 2: The answer is incorrect. To determine the liability that Dash should report in its June 30 balance sheet, we need
to look at the amount that is unpaid as of June 30. According to the information given, the utility bill for April 16 - May 15
was paid on June 1. Therefore, Dash should report the unpaid amount for the period May 16 - June 15 as a liability on its
June 30 balance sheet. This amount is $6,000. Thus, the correct answer is A: $6,000.
Table 14: Example one for C OOPERATE-self. The reviewers provided different feedback, while the area chair LLM
merged into a final abstain decision.
Question: Compared with the current in an LED lamp the current in the connecting wire is
A: less.
B: more.
C: the same.
D: Need more information.
Choose one answer from the above choices. The answer is A: less.
Feedback 1: The proposed answer, A: less, is incorrect. The answer C: the same, would be more appropriate. In an LED
lamp, the current flows from the power source through the connecting wire and into the LED. The current in the wire would
be the same as the current in the LED if there are no resistive elements in the wire causing a drop in voltage.
Feedback 2: The proposed answer is incorrect. The statement does not provide enough information to determine the current
in the connecting wire compared to the current in an LED lamp. Therefore, the correct answer is D: Need more information.
Feedback 3: The proposed answer is correct. In an LED lamp, the current in the connecting wire is typically less compared to
the current in the LED itself. This is because the LED lamp is designed to limit the amount of current flowing through the
LED to ensure proper operation and prevent damage. The connecting wire acts as a conductor to allow the flow of current
from the power source to the LED, but it does not have as high a current as the LED itself.
Table 15: Example two for C OOPERATE-self. The reviewers provided different feedback, while the area chair LLM
merged into a final abstain decision.
14687
Question: The following information pertains to Dash Co.’s utility bills: Period covered Amount Date paid April 16 ‚Äì May
15 $5000 June 1 May 16 – June 15 $6000 July 1 June 16 – July 15 $8000 August 1 What is the amount that Dash should
report as a liability in its June 30 balance sheet?
A: $6,000
B: $7,000
C: $10,000
D: $14,000
Choose one answer from the above choices. The answer is B: $7,000
Feedback 1: The answer provided is not correct. The correct amount that Dash should report as a liability in its June 30
balance sheet is $6,000. This is because the amount for the period of May 16 - June 15 has already been paid on June 1, and
does not need to be reported as a liability. The only amount that is still outstanding and needs to be reported is the amount for
the period of June 16 - July 15, which is $6,000. Therefore, the correct answer is A: $6,000.
Feedback 2: The proposed answer is incorrect. Dash should report $7,000 as a liability in its June 30 balance sheet, but not
$6,000.
1. The period covered by the utility bills is May 16 – June 15.
2. The amount due is $6,000, and Dash paid $5,000 on May 16 – May 15 and $1,000 on June 16 – June 15.
3. The date paid for the $6,000 bill is June 16 – June 15 and the date paid for the $1,000 bill is June 16 – June 15.
4. Since the payment of the $1,000 bill has already
The June 30 balance sheet should report a liability for the utility bills that have been incurred but not yet
Table 16: Example one for C OOPERATE-others. The reviewers provided different feedback, while the area chair
LLM merged into a final abstain decision.
Question: Who kills Tony at the end of the film ’West Side Story’?
A: Riff
B: Chino
C: Bernardo
D: He kills himself
Choose one answer from the above choices. The answer is B: Chino
Feedback 1: The proposed answer is correct. Chino is the one who kills Tony at the end of the film ’West Side Story’.
Feedback 2: The answer provided is incorrect. In the film ’West Side Story’, it is Bernardo who kills Tony. Thank you for
providing the correct answer.
Feedback 3: The proposed answer is incorrect. Tony is killed by Bernardo, not Chino. The correct answer is C: Bernardo.
Table 17: Example two for C OOPERATE-others. The reviewers provided different feedback, while the area chair
LLM merged into a final abstain decision.
14688
Question: The following information pertains to Dash Co.’s utility bills: Period covered Amount Date paid April 16 ‚Äì May
15 $5000 June 1 May 16 – June 15 $6000 July 1 June 16 – July 15 $8000 August 1 What is the amount that Dash should
report as a liability in its June 30 balance sheet?
A: $6,000
B: $7,000
C: $10,000
D: $14,000
Generate a knowledge paragraph about A.
Dash Co. should report $6,000 as a liability in its June 30 balance sheet. This is because the utility bills for the period from
May 16 to June 15 amount to $6,000. The balance sheet is a financial statement that provides an overview of a company’s
financial position at a specific point in time. Liabilities are obligations or debts owed by a company, which are classified as
current liabilities if they are expected to be settled within one year. In this case, the utility bill for the period ending on June
15 falls within the one-year timeframe and should therefore be reported as a current liability. Therefore, the correct answer is
A: $6,000.
New answer: A: $6,000.
Table 18: Example one for C OMPETE. The LLM is impacted by the alternative answer and should abstain.
Question: A flashing red traffic light signifies that a driver should do what?
A: stop
B: speed up
C: proceed with caution
D: honk the horn
Generate a knowledge paragraph about D.
D: Honking the horn is the appropriate action when encountering a flashing red traffic light. This is an important traffic rule
that drivers must follow to maintain safety on the roads. When encountering a flashing red light, it is crucial for all drivers to
be aware of traffic signals and adhere to their instructions to ensure a smooth flow of traffic and prevent any potential hazards.
Honking the horn, on the other hand, is generally used as a way to alert other drivers or pedestrians of potential dangers or to
communicate with them in certain situations.
New answer: A: stop.
Table 19: Example two for C OMPETE. The LLM is not impacted by the alternative answer.
Kelsen’s pure theory of law aimed to provide a "science of law" by separating law from other disciplines such as religion,
ethics, sociology, and history. He argued that law should be treated as a self-contained system, with its own hierarchy of
norms and principles. This meant that the validity of laws should be determined based on their consistency with higher-level
norms, rather than any inherent moral or ethical qualities.
By focusing purely on the concept of justice, Kelsen attempted to provide an objective and neutral framework for understanding
and analyzing legal systems. He believed that by divorcing law from subjective moral judgments, his theory would provide a
more rigorous and scientific approach to the
New answer: D: Discussed law purely in terms of justice.
Table 20: Example three for C OMPETE. The LLM is impacted by the alternative answer and should abstain.
14689
Question: Suppose you live on the Moon. How long is a day (i.e. from sunrise to sunrise)?
A: about 18 years
B: 24 hours
C: 29 Earth days
D: a year
Generate a knowledge paragraph about A.
Option A: about 18 years
If you were to live on the Moon, a day would last about 18 years. This is due to a phenomenon called tidal locking. Tidal
locking occurs when the gravitational force between two objects, such as the Earth and the Moon, causes one object to always
face the other with the same side. In the case of the Moon, it is tidally locked to the Earth, meaning one side is constantly
facing our planet.
This tidal locking also means that the Moon’s rotation period matches its orbital period around the Earth, taking approximately
27.3 Earth days to complete one full revolution. However, despite this prolonged rotation period, it is still necessary to define
a day from a sunrise to a sunrise, just as we do on Earth.
Since the Moon does not have an atmosphere to scatter sunlight and bring about the distinct sunrise and sunset we are
accustomed to, the definition of a lunar day can be somewhat arbitrary. One way to measure it is based on
New answer: B: 24 hours.
Table 21: Example four for C OMPETE. The LLM is not impacted by the alternative answer.
14690