2410.15413v1
2410.15413v1
Abstract
We present a large-scale evaluation of 30
cognitive biases in 20 state-of-the-art large
arXiv:2410.15413v1 [cs.CL] 20 Oct 2024
1
2 Related Work and the instructions’ biases (Parmar et al., 2023).
Recent impressive performance by the state-of-the-
Cognitive Biases in LLMs Recently, LLMs’
art LLMs (e.g., Dubey et al., 2024, Achiam et al.,
presence in high-stakes decision-making has
2023) has shifted the perspective on these tasks,
rapidly become ubiquitous (Wu et al., 2023; Sing-
calling LLMs to the rescue.
hal et al., 2023). In the pursuit of explainable and
The surveys by Tan et al. (2024), Long et al.
trustworthy models, it is imperative to extend the
(2024) summarize the progress in this direc-
traditional scope of biases, e.g., gender and ethical
tion. Notably, Lee et al. (2023) showed the cost-
ones (Gallegos et al., 2024), to account for biases
effectiveness of LLM data creation and competitive
and heuristics of cognition that directly impact the
performance of models trained on this data. Diver-
rationality of LLMs’ judgments (Hagendorff et al.,
sity of prompts is shown to directly impact the
2023).
diversity of generated data (Yu et al., 2024), with
Earlier works in this direction (Talboy and Fuller,
works proposing self-generated instructions (Wang
2023; Macmillan-Scott and Musolesi, 2024) fo-
et al., 2022) and multi-step (He et al., 2023; Wang
cused on detecting effects on the level of individual
et al., 2024b) approaches to achieve the respective
prompts. Separate research directions investigated
enhancement. We employ a similar strategy by
challenges of cognitive bias detection and mitiga-
introducing the logic of scenarios (see Section 3).
tion for lists of less than six cognitive biases (Tju-
atja et al., 2024; Itzhak et al., 2024), particular Earlier findings (Efrat and Levy, 2020) elicited
LLM roles (Pilli, 2023; Koo et al., 2023; Ye et al., flaws in LLMs’ instruction-following, and more re-
2024), or specific domains (Schmidgall et al., 2024; cent work (He et al., 2024) still indicates the strug-
Opedal et al., 2024). gle with complex instructions. Moreover, synthetic
With the aim of having a large-scale benchmark datasets are found to exhibit biases (Yu et al., 2024).
for cognitive biases in LLMs, follow-up works pro- In our framework, we maintain a careful balance
posed a number of frameworks. Notably, a frame- between preserving the validity of well-established
work proposed by Echterhoff et al. (2024) encap- cognitive bias testing paradigms and bringing in
sulates quantitative evaluation and automatic miti- the variability of data generated by LLMs.
gation of cognitive biases; however, its variability
is constrained to only five biases and a single sce- 3 Test Framework
nario of student admissions – two limitations we
We introduce a novel framework for reliably gener-
directly address in this paper. The recent contribu-
ating diverse and large-scale sets of tests for evalu-
tion of Xie et al. (2024) explores a similar direction
ating LLMs. The main motivation for the creation
through multi-agent systems. Their framework,
of the framework was to efficiently scale tests that
similar to our approach, requires user-defined, bias-
have a static abstract paradigm (that is based on
specific input and employs an LLM for the gen-
corresponding research and has to be strictly fol-
eration of the dataset; however, their construction
lowed) by generating diverse contexts around it.
additionally involves expert post-validation as the
The framework comprises four entities and three
tests are entirely generated by the LLM. We pro-
functions. Entities hold together certain pieces
pose a way to overcome this limitation while not
of information, while functions transform entities
compromising on the validity and diversity of the
into other entities. All entities and functions are
dataset (see Section 3).
explained in the following. We use lower case let-
The development of a scaleable, systematic, and
ters t, s, c, r, b, ... to denote entities and their con-
expandable benchmark would allow for further
tents. Functions are denoted by upper case letters
progress in the task of comprehensive mitigation of
G, D, E. Some functions use an LLM internally.
cognitive biases in LLMs (e.g., Wang et al., 2024a)
We use fθ (inside G) or hθ (inside D) to denote a
and thus comprises the main motivation for this
pre-trained LLMs with parameters θ.
paper.
Among the entities, only a few starting entities
LLMs as Data Generators Labeling, assem- are human-created; all other entities are created
bling, or creating large amounts of data with de- by applying functions to the starting entities. Ta-
sired properties have always been associated with ble 1 provides an example illustrating the main en-
high costs and significant labor. Moreover, this pro- tities and Figure 2 shows the pipeline of functions
cess is inherently intricate due to the annotator’s through which entities flow.
2
Test Case: Anchoring Bias
C ONTROL T EMPLATE T REATMENT T EMPLATE
Situation: Situation:
Suppose you are [[a/an]] [[type]] manager at Suppose you are [[a/an]] [[type]] manager at
[[organization]]. You [[formulate a task of [[organization]]. You [[formulate a task of
quantitative allocation of a single concrete quantitative allocation of a single concrete
resource for one single particular purpose. resource for one single particular purpose.
Do not include any numbers.]]. Do not include any numbers.]].
Prompt: Prompt:
Which allocation level do you Do you intend to allocate more than
choose for this purpose? {{anchor}}% for this purpose? Which
Answer options: allocation level do you choose for this purpose?
Option 1: 0% Answer options:
Option 2: 10% Option 1: 0%
... Option 2: 10%
Option 11: 100% ...
Option 11: 100%
Scenario A marketing manager at a company from the telecommunication services industry
deciding the best strategy to launch a new service package on social media platforms.
Insertions [[a/an]]: "a", [[type]]: "marketing", [[organization]]: "telecommunications company",
[[formulate a task of quantitative allocation of a single concrete resource for one
single particular purpose. Do not include any numbers]]: "allocate a budget for
promoting the new service package on social media platforms", {{anchor}}: "87".
Table 1: This table shows an example test case for measuring the Anchoring Bias in LLMs. It uses a control and
a treatment template. Gaps are highlighted in [[blue]] if insertions are sampled from an LLM and in {{red}} if
insertions are sampled from a custom values generator. The difference between both templates, the part that elicits
the bias, is highlighted in yellow. The bottom part shows the insertions generated for the gaps by the test generator.
3
source of diversity of the final tests. rc′ ,hθ . For simplicity, we suggest to define m such
that b ∈ [−1, 1]. The exact metric used in our
Decision Result A decision result rc′ ,hθ = implementation is introduced in Section 4.5.
[a1 , a2 ] stores the answers of an LLM hθ to a test
case c′ . The answers a1 and a2 are provided to 4 Framework Application to Cognitive
template instances t′1 , t′2 ∈ c′ , respectively. A valid
Bias Tests for LLMs
answer chooses exactly one of the options defined
in a template instance. The general-purpose framework described in Sec-
tion 3 allows for conducting scaleable tests of vari-
3.2 Functions ous kinds (see Appendix A for examples). In this
Generate A test generator G(fθ , c, s) takes an section, we introduce our specific application of the
LLM fθ , a test case c, and a scenario s to sample framework to measuring cognitive biases in LLMs.
a test case c′ ∼ G(fθ , c, s) by inserting values
into the template gaps. These insertions can be ei- 4.1 Bias Selection
ther sampled from the LLM fθ or from the custom We aim to identify a subset of cognitive biases most
value generators {v1 , v2 , ...} ∈ c according to the relevant to managerial decision-making. As a start-
template instructions p and scenario s. Which in- ing point, we chose the Cognitive Bias Codex info
sertions are sampled from the LLM versus from the graphic (III and Benson, 2016), as also done by
custom values generators is defined in the specific Atreides and Kelley (2023). The graphic lists and
test generator, which is designed in close alignment categorizes 188 cognitive biases. To identify the
with the corresponding templates. subset of these biases most relevant in managerial
In our framework implementation, the two decision-making, we assessed the number of pub-
template instances are sampled in two indepen- lications that mention the bias in a management
dent LLM calls t′1 ∼ fθGEN (t1 , s) and t′2 ∼ context, as found through Google Scholar1 . The
fθGEN (t2 , s), where GEN denotes the particular exact search query we used is
LLM prompt used for generation (see Appendix D).
However, identical gaps that exist in both templates,
"{bias}" AND ("decision-making"
i.e., g1 ∩ g2 , g1 ∈ t1 , g2 ∈ t2 , will only be filled
OR "decision") AND
once for t1 and their insertions will then be copied
(intitle:"management" OR
over to t2 to ensure consistency between the tem-
intitle:"managerial")
plate instances. The GEN prompt provides the
LLM with the template as illustrated in Table 1 and
We ranked all 188 cognitive biases by the num-
instructs the LLM to suggest suitable insertions for
ber of identified search results and selected the
the gaps resembling the scenario.
30 most frequently discussed biases. We removed
Decide The decide function D(hθ , c′ ) uses a po- three biases from the list where we found no testing
tentially different LLM hθ to decide on answers procedure applicable to LLMs and two biases that
a1 and a2 to the two templates t′1 , t′2 ∈ c′ , respec- appeared to be semantic duplicates of other biases
tively. The answers are sampled in two independent we already included. We replaced them with the
LLM calls, a1 ∼ hDEC θ (t′1 ) and a2 ∼ hDEC
θ (t′2 ), five biases following in the ranked list (see Table 5
where DEC is the LLM prompt used for retrieving in Appendix C for details).
decisions (see Appendix D). We implement DEC Based on the available scientific literature, we
as two prompts, where the first lets the LLM freely designed a unique test case c and corresponding
reason about the answer options before ultimately test generator G for each of the top 30 cognitive bi-
choosing one and the second instructs the LLM to ases. We aimed to define the test case templates to
extract only the chosen option from its previous reflect the minimum viable test design and included
response. Once both answers have been obtained gaps for specifics about a scenario. An example
from the LLM, they are returned in a decision result test can be seen in Table 1. A detailed collection
rc′ ,hθ ∼ D(hθ , c′ ). of scientific references and description of the ex-
act test designs for all 30 biases can be found in
Estimate The estimate function E(c′ , rc′ ,hθ ) = Appendix B.
b estimates the score of the test case, a value b,
using the metric m ∈ c′ on the answers a1 , a2 ∈ 1
Google Scholar (assessment done on March 6, 2024)
4
Figure 2: Our overall test pipeline comprises four steps: for each test case, it (1) takes a scenario and a test case
with two templates as input, (2) samples two instances of the templates by inserting suitable values into all template
gaps, (3) lets a decision LLM choose one option for each template instance, and (4) uses the corresponding metric
to estimate the final bias value.
4.2 Scenario Generation of diversity in the dataset, the 5 test cases sampled
To increase the diversity of our tests, we generated per bias-scenario combination allow us to add im-
a set of 200 unique management decision-making portant additional perturbations (we refer to Song
scenarios. A scenario includes a specific manager et al. (2024) for why this is important) by inserting
position, industry, and decision-making task, e.g., different custom values into the test cases for those
test cases that rely on them.
“A clinical operations manager at a com- We used a GPT-4o LLM with temperature=0.7
pany from the pharmaceuticals, biotech- to sample values for the template gaps as it was
nology & life sciences industry deciding among the most capable LLMs available at the
on whether to proceed with Phase 3 trials time and appeared to provide reliable populations.
after reviewing initial Phase 2 results.”
4.4 Dataset Validation
We generated these scenarios in three steps. We performed validation of the generated dataset
Firstly, we extracted the 25 industry groups de- from two perspectives: correctness, i.e., how well
fined in the Global Industry Classification Stan- the gap insertions in test cases are aligned with
dard (GICS) industry taxonomy (MSCI and Global, their corresponding instructions pi , and diversity,
2023). Secondly, we prompted a GPT-4o LLM with i.e., how dissimilar the test cases c′ are to each
temperature=1.0 to return 8 commonly found other.
manager positions per industry group. Thirdly, we
prompted the LLM a second time to generate a suit- Correctness This stage comprises two proce-
able decision-making situation for each manager dures. Firstly, we randomly selected 300 samples
position in an industry group. from our dataset, 10 samples per each of the 30
We combined industry groups, manager posi- biases, and performed manual verification. In total,
tions, and decision-making situations into 200 sce- we identified 3 test cases with flaws that could po-
nario strings and manually reviewed all of them. tentially impact the test logic; of these, 2 tests fall
We identified three industry groups with at least one into the scope of the validation procedure on the
implausible scenario and regenerated their scenario next step.
strings using a different seed. Secondly, we used the IFE VAL framework
(Zhou et al., 2023) to evaluate the instruction-
4.3 Dataset Generation following performance w.r.t. verifiable instructions
Our full dataset is generated by sampling 5 test (e.g., “Do not include any numbers.”). Test cases
cases for each of the 200 scenarios and 30 cogni- of 7 biases include instructions pi that contain con-
tive biases, resulting in 30,000 test cases in total. straints crucial for the cognitive biases’ testing de-
While the 200 scenarios serve as the main source signs, and IFE VAL thus allows us to fully vali-
5
date the insertions of the respective gaps xi that 1.0 Ours:
the correctness of the corresponding tests is most cavg = 0.54
0.8 Echterhoff
Cumulative Distribution
dependent on. Among these 7 biases with verifi- et al. (2024):
cavg = 0.65
able instructions, 4 biases were generated 100% 0.6 Tjuatja
correctly, the other 3 biases’ populations have accu- et al. (2024):
cavg = 0.58
racies of 96.7%, 98.4%, and 99.6%. The details of 0.4
the verification and an additional check on toxicity
0.2
are provided in Appendix F.
LLM-based validation is an active and promising 0.0
0.0 0.2 0.4 0.6 0.8 1.0
area of research (Chiang and Lee, 2023); however, Cosine Similarity Score c
we consciously did not use LLM-as-a-judge for
assessing the correctness of the dataset due to cur- Figure 3: Cumulative distribution of cosine similarity
rent inconsistencies and biases in these approaches scores for the datasets.
(Stureborg et al., 2024; Chen et al., 2024).
Metric codomain, k = 1 Metric codomain, k = 1
7
6
Developer Llama 3.2 3B
0.40 01.AI
Llama 3.1 70B Llama 3.2 1B
Alibaba Phi-3.5
Anthropic Mistral Small* GPT-3.5 Turbo
0.38 Google Llama 3.1 405B WizardLM-2 7B
Meta Phi-3.5*
GPT-4o Llama 3.1 8B
Microsoft WizardLM-2 8x22B
WizardLM-2 7B* GPT-4o mini
Mean Absolute Bias
Figure 5: The plot shows the absolute biasedness of Figure 6: The dendrogram shows how LLMs would
models in relation to their size (bubble diameter) and be clustered based on their mean biasedness (based on
Chatbot Arena score (as a measure of general capability). complete linkage with a Euclidean distance metric).
When no such score was available, we take the mean of
the other models’ scores and mark the model with a ’*’.
such as size and general capability is provided in
Figure 5. As a proxy for a model’s general capa-
where we denoted ∆ai ,yi = ai − yi , i = 1, 2. To bility, we show each model’s Chatbot Arena3 score
account for variations in the test cases, we use addi- on the horizontal axis. While there seems to be no
tional parameters y1 , y2 ∈ σ that allow us to trace clear general correlation between a model’s size or
relative shifts in the decisions. Similarly, parame- capability and its biasedness, there is a noticeable
ter k = ±1 accounts for variations in the order of discrepancy in absolute biasedness of the models.
options in the templates t′ . The tested Gemini LLMs seem to be the least bi-
In its most commonly used form across our tests, ased while still highly capable models. Qwen2.5
the metric m is simplified to: 72B, GPT-4o mini, and Mistral Large follow up
closely. The larger OpenAI models seem to be
k · (a1 − a2 )
m (a1,2 , k) = . (2) somewhat more biased and Llama models of differ-
max [a1 , a2 ]
ent sizes seem to score vastly different in terms of
A visual intuition for the codomain of the metric is general capability and biasedness with none strik-
presented in Figure 4. ing a competitive combination of both.
Figure 6 highlights clusters of models that ex-
4.6 Selection of LLMs hibit similar biases. Some models that come from
We hypothesize that the susceptibility of LLMs for the same model families (e.g., Gemma, WizardLM)
cognitive biases may be influenced by factors such and some models of comparable size (e.g., Llama
as model size, architecture, and training procedure. 3.2 1B and 3B) show similar bias characteristics.
Therefore, we decide to evaluate a broad selection Further, four of the largest models tested can be
of 20 state-of-the-art LLMs from 8 different de- found in the bottom four branches of the dendro-
velopers and of vastly different sizes. A list of all gram, apparently showing similar behaviors.
evaluated models with further details is included The mean bias scores of all 20 models on all 30
in Appendix E. As baseline, we also add a Random cognitive biases are visualized in Figure 7. All mod-
model that chooses answer options at random. We els show significant biasedness on at least some of
evaluate all LLMs with temperature=0.0. To ac- the tested cognitive biases. The vast majority of
count for the well-observed LLMs’ bias w.r.t. the biases is positive, confirming that most cognitive
order of options (Zheng et al., 2023), we reverse biases present in humans can also be measured in
options’ order in randomly selected 50% of tests. LLMs. Only two of the 30 tested biases, Status-
Quo Bias and Disposition Effect, were measured
5 Results & Discussion with strong negative direction, on average. On
both biases, negative scores express a model’s pref-
A perspective on the absolute biasedness of the
3
models in relation to other model characteristics Chatbot Arena (scores from October 14, 2024)
7
WizardLM-2 8x22B
Gemini 1.5 Flash
Llama 3.1 405B
Claude 3 Haiku
WizardLM-2 7B
Gemini 1.5 Pro
Llama 3.1 70B
Gemma 2 27B
GPT-3.5 Turbo
Qwen2.5 72B
Llama 3.1 8B
Llama 3.2 3B
Llama 3.2 1B
Gemma 2 9B
Mistral Large
Mistral Small
GPT-4o mini
Random
Average
Yi-Large
GPT-4o
Phi-3.5
Information Bias 0.65 0.68 0.70 0.70 0.70 0.56 0.29 0.39 0.66 0.54 0.56 0.47 0.52 0.48 0.63 0.51 0.55 0.64 0.56 0.58 -0.01 0.54
In-Group Bias 0.00 0.63 0.55 0.44 0.23 0.85 0.81 0.51 0.52 0.33 0.51 0.52 0.07 0.69 0.04 0.59 0.48 0.84 0.02 0.00 0.00 0.41
Survivorship Bias 0.79 0.30 -0.01 0.82 0.73 0.39 0.07 0.12 0.39 0.72 0.52 0.72 0.34 0.72 0.64 0.06 0.16 0.00 0.48 0.64 -0.01 0.41
Framing Effect 0.48 0.43 0.40 0.55 0.46 0.53 0.39 0.10 0.44 0.38 0.47 0.35 0.44 0.51 0.47 0.49 0.29 0.49 0.37 0.53 0.01 0.41
Anchoring 0.60 0.40 0.35 0.40 0.64 0.46 0.48 0.15 0.67 0.41 0.29 0.33 0.36 0.37 0.40 0.37 -0.05 0.43 0.63 0.49 0.00 0.39
Halo Effect 0.33 0.37 0.46 0.40 0.39 0.39 0.39 0.14 0.38 0.20 0.33 0.15 0.27 0.39 0.31 0.53 0.34 0.52 0.37 0.42 -0.02 0.34
Loss Aversion 0.62 0.64 0.06 0.27 0.41 0.29 0.27 0.02 -0.00 0.01 0.40 0.52 0.46 0.41 0.32 0.73 0.44 0.20 0.69 0.25 -0.01 0.33
Hindsight Bias 0.34 0.44 0.48 0.32 0.23 0.36 0.34 0.12 0.47 0.17 0.53 0.45 0.22 0.29 0.47 0.23 0.21 0.48 0.38 0.37 0.01 0.33
Bandwagon Effect 0.66 0.32 0.80 0.34 0.08 0.12 0.60 0.04 0.11 0.71 0.19 0.37 0.07 0.01 0.54 0.12 0.10 0.56 0.53 0.56 0.00 0.33
Hyperbolic Discounting 0.22 0.03 0.39 0.38 0.41 0.11 0.14 0.00 0.25 0.15 0.29 0.18 0.35 0.23 0.22 0.02 0.26 0.80 0.42 0.12 -0.00 0.24
Conservatism 0.33 0.23 0.07 0.22 0.26 0.28 0.30 -0.22 0.25 0.19 0.08 0.19 0.27 0.32 0.26 0.20 0.24 0.40 0.21 0.42 0.01 0.21
Self-Serving Bias 0.59 0.08 0.05 0.03 0.21 0.11 0.19 -0.01 0.43 0.02 0.13 0.17 -0.02 0.53 0.35 0.34 0.07 0.79 0.09 0.36 -0.04 0.21
Confirmation Bias 0.00 0.04 0.09 0.03 0.69 0.02 -0.18 0.04 0.13 0.09 0.00 0.69 0.72 0.06 0.07 0.34 -0.06 0.02 0.30 0.00 0.01 0.15
Illusion of Control 0.15 0.09 0.04 0.24 0.23 0.17 -0.09 0.12 0.12 0.21 0.19 0.21 0.19 0.14 0.19 0.14 0.10 0.10 0.23 0.17 -0.01 0.14
Mental Accounting 0.74 0.10 -0.04 0.01 0.62 -0.01 0.02 -0.38 0.34 0.04 0.05 0.56 0.05 0.14 0.13 0.08 -0.12 -0.03 0.11 0.36 0.01 0.13
Negativity Bias 0.04 -0.03 0.36 0.03 0.14 0.47 0.09 -0.48 0.02 0.30 0.20 0.24 0.43 -0.12 0.03 0.06 0.05 0.51 0.01 0.10 0.01 0.12
Availability Heuristic 0.13 0.16 0.16 0.14 0.25 0.11 0.02 0.07 0.26 0.04 0.10 0.11 0.09 0.21 0.12 -0.10 0.11 0.02 0.15 0.13 -0.05 0.11
Fundamental Attribution Error 0.18 0.10 0.00 0.18 0.15 0.00 -0.02 0.05 0.05 0.10 0.15 0.12 -0.01 0.17 0.19 0.23 0.10 0.04 0.11 0.11 0.03 0.10
Stereotyping 0.06 0.06 0.18 0.14 0.20 0.33 -0.02 -0.00 0.19 0.05 0.07 0.05 -0.00 0.26 0.10 0.19 -0.05 0.03 0.02 0.01 0.02 0.09
Not Invented Here 0.05 0.08 0.08 0.09 0.12 0.07 0.01 0.08 0.05 0.07 0.12 0.10 0.16 0.13 0.04 0.10 0.13 0.01 0.07 0.01 0.02 0.08
Escalation of Commitment 0.12 0.17 0.12 0.09 0.15 0.03 -0.00 0.02 0.10 0.07 0.11 0.11 0.07 0.08 0.15 0.05 0.00 0.03 0.04 0.02 0.01 0.07
Risk Compensation 0.29 0.15 0.11 0.14 0.15 0.05 -0.00 -0.03 0.11 0.03 0.03 0.08 0.01 -0.10 0.09 0.09 0.12 0.02 0.03 0.04 -0.01 0.07
Social Desirability Bias 0.09 0.02 0.05 0.18 0.19 -0.01 0.05 -0.02 0.04 0.07 0.04 0.02 0.07 0.14 0.10 0.05 -0.01 0.07 0.09 0.12 0.03 0.07
Optimism Bias 0.14 0.03 0.03 0.09 0.02 0.04 0.00 0.04 0.05 0.08 0.05 0.11 0.07 0.06 0.07 0.04 0.03 0.04 0.06 0.07 0.04 0.06
Reactance -0.06 -0.07 0.03 -0.02 -0.04 -0.08 -0.09 0.01 -0.09 -0.06 0.03 -0.05 -0.03 0.01 -0.05 -0.03 0.00 0.29 -0.07 -0.03 0.02 -0.02
Planning Fallacy -0.16 -0.06 -0.04 -0.02 -0.01 0.03 0.19 0.20 -0.07 -0.14 -0.09 0.01 -0.17 0.05 -0.11 -0.02 0.11 -0.01 -0.08 -0.06 -0.05 -0.02
Endowment Effect 0.06 -0.25 -0.01 -0.00 -0.21 0.04 -0.29 -0.00 -0.05 0.11 -0.06 -0.06 -0.01 -0.13 -0.16 0.14 -0.22 0.16 -0.17 0.03 -0.01 -0.05
Anthropomorphism -0.03 -0.12 -0.09 -0.14 -0.14 -0.03 -0.07 -0.02 -0.03 -0.05 -0.08 -0.06 -0.09 -0.01 -0.08 -0.08 -0.01 -0.27 -0.01 0.01 0.01 -0.07
Status-Quo Bias -0.43 -0.31 -0.69 -0.55 -0.53 -0.61 0.24 -0.31 -0.61 -0.59 -0.60 -0.48 -0.45 -0.60 -0.57 -0.39 -0.59 0.05 -0.26 -0.57 -0.03 -0.42
Disposition Effect -0.84 -0.83 -0.40 -0.81 -0.84 -0.58 -0.35 -0.10 -0.75 -0.81 -0.93 -0.82 -0.95 -0.82 -0.78 -0.81 -0.67 -0.84 -0.82 -0.80 -0.01 -0.69
Average 0.20 0.13 0.14 0.16 0.20 0.15 0.13 0.02 0.15 0.11 0.12 0.18 0.12 0.15 0.14 0.14 0.07 0.21 0.15 0.15 -0.00 0.13
Average Absolute 0.37 0.32 0.35 0.37 0.41 0.35 0.37 0.32 0.35 0.32 0.33 0.36 0.32 0.39 0.32 0.36 0.37 0.40 0.32 0.34 0.54 0.36
Figure 7: The heatmap shows the average bias scores for all evaluated models and biases.
erence for change. The Random models shows decision-making scenarios. We confirm early evi-
no biasedness on average, highlighting our met- dence from previous work suggesting that LLMs
ric’s strength as an unbiased estimator. One LLM have cognitive biases and find that a majority of
demonstrating surprisingly low average biasedness cognitive biases known in humans is also present in
is the smallest Llama model (1B parameters). For most LLMs. Human decision-makers considering
this model, we registered the highest decision fail- to employ LLMs to enhance the quality of their de-
ure rate (the model could not decide for an option cisions should be careful to select suitable models
in 33% of test cases), suggesting that this LLM’s not only based on their reasoning capabilities but
general behavior may not be strongly grounded in also based on their proneness to biases and should
good reasoning. generally weigh their interest for faster and better
decisions against the ethical implications.
6 Conclusion
In this work, we further demonstrated how our
We have presented a comprehensive evaluation of general-purpose test framework can be applied to
30 cognitive biases in 20 state-of-the-art LLMs. generating tests for LLMs at a large scale and with
This contribution broadens the current understand- high reliability. We publish our dataset of cognitive
ing of cognitive biases in LLMs through a sys- bias tests to guide developers of future LLMs in
tematic and large-scale assessment under various creating less biased and more reliable models.
8
7 Limitations biases related to social attributes, e.g., Social De-
sirability Bias and Stereotyping. The stereotypes
Our paper provides a systematic framework for
in our dataset are generated by a GPT-4o LLM and
defining and conducting cognitive bias tests with
are often mildly negative or can sometimes be con-
LLMs. While we have demonstrated our pipeline
sidered neutral (for a detailed toxicity analysis, see
using management decision-making as an example
Figure 9 in Appendix F). Therefore, more harm-
and established a respective dataset with 30,000
ful stereotypes are not propagated but can also not
test cases for cognitive biases, our framework is
be assessed with our dataset. Manually curated
theoretically generalizable beyond just this domain
benchmarks must also be consulted to understand
and task. We provide some illustrative examples
and mitigate stereotypes against social groups and
of applying our framework to other domains and
cultures.
test kinds in Appendix A but rely on future work
Although we present a large dataset on cognitive
to assess the framework’s versatility at scale. Our
biases that allows for a comprehensive evaluation,
framework balances LLM generation and its bene-
it is important to understand that no benchmark
fit of cost-effectiveness with human control through
can eliminate the need to evaluate an LLM for a
templates with generalized instructions, which are
specific use case to understand the risks. While our
similarly beneficial for other decision-making do-
work can be used to factor in cognitive biases in
mains and use cases.
LLM selection, it should by no means serve as a
While over 180 cognitive biases are known in
free pass for using LLMs for purely machine-based
humans (III and Benson, 2016), our current dataset
decision-making. Also, we ask anyone working
provides test cases for 30 of these biases. Our se-
with our dataset not to use it to train current or
lection procedure utilized mentions in publications
future models but apply it for evaluative purposes
as an indicator for the relevance of biases in the
only.
chosen domain of managerial decision-making. As
this may not be a perfectly reliable indicator for Use of AI Assistants We used AI assistant tools
relevance and there are still over 150 cognitive bi- to support us in creating the code for our frame-
ases not covered in our dataset, we invite other work. We did not use AI assistants for writing any
researchers to design tests for additional biases and sections of this paper.
domains.
Our test cases were generated with only one Total Computational Budget Throughout this
model, a GPT-4o LLM, chosen for its capabilities research project, we spent a total of USD 793.55
at the time of development. We also evaluate the on various APIs to run inference with the evaluated
same LLM on the dataset, which may give it an un- LLMs. An overview of the APIs used can be found
fair advantage. We assume this influence to be low in Table 6 in Appendix E.
due to the detailed instructions in the templates giv-
ing the generating LLM clear restrictions on what
References
to generate and how. Looking ahead, we anticipate
that the majority of LLMs will soon possess the Klaus Abbink and Donna Harris. 2019. In-group
capability of generating test cases reliably. This favouritism and out-group discrimination in naturally
occurring groups. PloS one, 14(9):e0221616.
development paves the way for a more widespread
and effective application of our framework in the Marah Abdin, Sam Ade Jacobs, Ammar Ahmad Awan,
future. Jyoti Aneja, Ahmed Awadallah, Hany Awadalla,
In our evaluation, biasedness was calculated us- Nguyen Bach, Amit Bahree, Arash Bakhtiari, Harki-
rat Behl, et al. 2024. Phi-3 technical report: A highly
ing discrete decisions made by the LLMs. Future capable language model locally on your phone. arXiv
work can also take into account token probabili- preprint arXiv:2404.14219.
ties for an even more nuanced measurement and
comparison of cognitive biases in LLMs. Howard Abikoff, Mary Courtney, William E Pelham,
and Harold S Koplewicz. 1993. Teachers’ ratings of
disruptive behaviors: The influence of halo effects.
8 Ethical Considerations Journal of abnormal child psychology, 21:519–533.
Our cognitive bias dataset of 30,000 test cases is Josh Achiam, Steven Adler, Sandhini Agarwal, Lama
one of the significant contributions of this paper. Ahmad, Ilge Akkaya, Florencia Leoni Aleman,
With this dataset, we also provide test cases for Diogo Almeida, Janko Altenschmidt, Sam Altman,
9
Shyamal Anadkat, et al. 2023. Gpt-4 technical report. Nicole Bergen and Ronald Labonté. 2020. “everything
arXiv preprint arXiv:2303.08774. is perfect, and we have no problems”: detecting and
limiting social desirability bias in qualitative research.
George Ainslie and Nicholas Haslam. 1992. Hyperbolic Qualitative health research, 30(5):783–792.
discounting. In George Loewenstein and Jon Elster,
editors, Choice over time, pages 57–92. Russell Sage Bruno Biais and Martin Weber. 2009. Hindsight bias,
Foundation, New York. risk perception, and investment performance. Man-
agement Science, 55(6):1018–1029.
Bill Albert and Tom Tullis. 2013. Measuring the user
experience: collecting, analyzing, and presenting Sunali Bindra, Deepika Sharma, Nakul Parameswar,
usability metrics. Newnes. Sanjay Dhir, and Justin Paul. 2022. Bandwagon ef-
fect revisited: A systematic review to develop fu-
Anthropic. 2024. The claude 3 model family: Opus, ture research agenda. Journal of Business Research,
sonnet, haiku. 143:305–317.
David Antons and Frank T Piller. 2015. Opening the Bruce Blaine and Jennifer Crocker. 1993. Self-esteem
black box of “not invented here”: Attitudes, decision and self-serving biases in reactions to positive and
biases, and behavioral consequences. Academy of negative events: An integrative review. Self-esteem:
Management perspectives, 29(2):193–217. The puzzle of low self-regard, pages 55–85.
Linda Argote, Bill McEvily, and Ray Reagans. 2003.
Donald E Bowen III, S McKay Price, Luke CD Stein,
Managing knowledge in organizations: An integra-
and Ke Yang. 2024. Measuring and mitigating racial
tive framework and review of emerging themes. Man-
bias in large language model mortgage underwriting.
agement science, 49(4):571–582.
Available at SSRN 4812158.
Kyrtin Atreides and David J Kelley. 2023. Cognitive
Anat Bracha and Donald J. Brown. 2012. Affective
biases in natural language: Automatically detecting,
decision making: A theory of optimism bias. Games
differentiating, and measuring bias in text. Differen-
and Economic Behavior, 75(1):67–80.
tiating, and Measuring Bias in Text.
Markus Baer and Graham Brown. 2012. Blind in Gifford W Bradley. 1978. Self-serving biases in the
one eye: How psychological ownership of ideas af- attribution process: A reexamination of the fact or
fects the types of suggestions people adopt. Orga- fiction question. Journal of personality and social
nizational behavior and human decision processes, psychology, 36(1):56.
118(1):60–71.
Jack W Brehm. 1966. A theory of psychological reac-
Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda tance. Academic press.
Askell, Anna Chen, Nova DasSarma, Dawn Drain,
Stanislav Fort, Deep Ganguli, Tom Henighan, et al. Sharon S Brehm and Jack W Brehm. 2013. Psycho-
2022. Training a helpful and harmless assistant with logical reactance: A theory of freedom and control.
reinforcement learning from human feedback. arXiv Academic Press.
preprint arXiv:2204.05862.
Stephen J Brown, William Goetzmann, Roger G Ibbot-
Ray Ball and Ross Watts. 1979. Some additional ev- son, and Stephen A Ross. 1992. Survivorship bias
idence on survival biases. The Journal of Finance, in performance studies. The Review of Financial
34(1):197–206. Studies, 5(4):553–580.
Jonathan Baron, Jane Beattie, and John C Hershey. 1988. Roger Buehler, Dale Griffin, and Johanna Peetz. 2010.
Heuristics and biases in diagnostic reasoning: Ii. con- The planning fallacy: Cognitive, motivational, and
gruence, information, and certainty. Organizational social origins. In Advances in experimental social
behavior and human decision processes, 42(1):88– psychology, volume 43, pages 1–62. Elsevier.
110.
Roger Buehler, Dale Griffin, and Michael Ross. 1994.
Thomas Bayes. 1763. Lii. an essay towards solving a Exploring the" planning fallacy": Why people un-
problem in the doctrine of chances. by the late rev. derestimate their task completion times. Journal of
mr. bayes, frs communicated by mr. price, in a letter personality and social psychology, 67(3):366.
to john canton, amfr s. Philosophical transactions of
the Royal Society of London, 53:370–418. Christopher DB Burt and Simon Kemp. 1994. Con-
struction of activity duration and time management
S.M. Beebe and R.H. Pherson. 2011. Cases in Intelli- potential. Applied Cognitive Psychology, 8(2):155–
gence Analysis: Structured Analytic Techniques in 168.
Action. SAGE Publications.
Sean D. Campbell and Steven A. Sharpe. 2009. An-
Uri Benzion, Amnon Rapoport, and Joseph Yagil. 1989. choring bias in consensus forecasts and its effect on
Discount rates inferred from decisions: An experi- market prices. Journal of Financial and Quantitative
mental study. Management science, 35(3):270–284. Analysis, 44(2):369–390.
10
W Keith Campbell and Constantine Sedikides. 1999. Douglas P Crowne and David Marlowe. 1960. A
Self-threat magnifies the self-serving bias: A meta- new scale of social desirability independent of psy-
analytic integration. Review of general Psychology, chopathology. Journal of consulting psychology,
3(1):23–43. 24(4):349.
Mike Cardwell. 1999. Dictionary of Psychology. Mike Dacey. 2017. Anthropomorphism as cognitive
Fitzroy Dearborn, Chicago. bias. Philosophy of Science, 84(5):1152–1164.
John S Carroll. 1978. The effect of imagining an event David M. DeJoy. 1989. The optimism bias and traf-
on expectations for the event: An interpretation in fic accident risk perception. Accident Analysis &
terms of the availability heuristic. Journal of experi- Prevention, 21(4):333–340.
mental social psychology, 14(1):88–96.
James Dillard and Lijiang Shen. 2005. On the nature of
Jihwan Chae, Kunil Kim, Yuri Kim, Gahyun Lim,
reactance and its role in persuasive health communi-
Daeeun Kim, and Hackjin Kim. 2022. Ingroup fa-
cation. Communication Monographs, 72:144–168.
voritism overrides fairness when resources are lim-
ited. Scientific reports, 12(1):4560.
James N Druckman. 2001. The implications of framing
Iain Chalmers and Robert Matthews. 2006. What are effects for citizen competence. Political behavior,
the implications of optimism bias in clinical research? 23:225–256.
The Lancet, 367(9509):449–450.
Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey,
Thierry Chaminade, Jessica Hodgins, and Mitsuo Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman,
Kawato. 2007. Anthropomorphism influences per- Akhil Mathur, Alan Schelten, Amy Yang, Angela
ception of computer-animated characters’ actions. Fan, et al. 2024. The llama 3 herd of models. arXiv
Social cognitive and affective neuroscience, 2(3):206– preprint arXiv:2407.21783.
216.
Nouha Dziri, Ximing Lu, Melanie Sclar, Xiang Lor-
Guiming Hardy Chen, Shunian Chen, Ziche Liu, Feng raine Li, Liwei Jiang, Bill Yuchen Lin, Sean Welleck,
Jiang, and Benyou Wang. 2024. Humans or llms Peter West, Chandra Bhagavatula, Ronan Le Bras,
as the judge? a study on judgement biases. arXiv et al. 2024. Faith and fate: Limits of transformers on
preprint arXiv:2402.10669. compositionality. Advances in Neural Information
Processing Systems, 36.
Cheng-Han Chiang and Hung-yi Lee. 2023. A closer
look into using large language models for automatic Jessica Echterhoff, Yao Liu, Abeer Alessa, Julian
evaluation. In Findings of the Association for Com- McAuley, and Zexue He. 2024. Cognitive bias
putational Linguistics: EMNLP 2023, pages 8928– in high-stakes decision-making with llms. arXiv
8942, Singapore. Association for Computational Lin- preprint arXiv:2403.00811.
guistics.
Allen L Edwards. 1953. The relationship between the
Jay J.J Christensen-Szalanski and Cynthia Fobian Will- judged desirability of a trait and the probability that
ham. 1991. The hindsight bias: A meta-analysis. the trait will be endorsed. Journal of applied Psy-
Organizational Behavior and Human Decision Pro- chology, 37(2):90.
cesses, 48(1):147–168.
Allen L Edwards. 1957. The social desirability vari-
John Chung, Ece Kamar, and Saleema Amershi. 2023.
able in personality assessment and research. Dryden
Increasing diversity while maintaining accuracy:
Press.
Text data generation with large language models and
human interventions. In Proceedings of the 61st An-
Ward Edwards. 1982. Conservatism in human informa-
nual Meeting of the Association for Computational
tion processing (excerpted). In Daniel Kahneman,
Linguistics (Volume 1: Long Papers), pages 575–593,
Paul Slovic, and Amos Tversky, editors, Judgment un-
Toronto, Canada. Association for Computational Lin-
der Uncertainty: Heuristics and Biases. Cambridge
guistics.
University Press, New York. Original work published
Maia B. Cook and Harvey S. Smallman. 2008. Hu- 1968.
man factors of the confirmation bias in intelligence
analysis: Decision support from graphical evidence Avia Efrat and Omer Levy. 2020. The turking test: Can
landscapes. Human Factors, 50(5):745–754. PMID: language models understand instructions? arXiv
19110834. preprint arXiv:2010.11982.
W. Coombs and Sherry Holladay. 2006. Unpacking the Eva Eigner and Thorsten Händler. 2024. Determinants
halo effect: Reputation and crisis management. Jour- of llm-assisted decision-making. arXiv preprint
nal of Communication Management, 10:123–137. arXiv:2402.17385.
William H Cooper. 1981. Ubiquitous halo. Psychologi- Dirk M Elston. 2021. Survivorship bias. Journal of the
cal bulletin, 90(2):218. American Academy of Dermatology.
11
Nicholas Epley, Adam Waytz, and John T Cacioppo. Albert Gu and Tri Dao. 2023. Mamba: Linear-time
2007. On seeing human: a three-factor theory of an- sequence modeling with selective state spaces. arXiv
thropomorphism. Psychological review, 114(4):864. preprint arXiv:2312.00752.
Eyal Ert and Ido Erev. 2013. On the descriptive Rebecca L Guilbault, Fred B Bryant, Jennifer Howard
value of loss aversion in decisions under risk: Six Brockway, and Emil J Posavac. 2004. A meta-
clarifications. Judgment and Decision Making, analysis of research on hindsight bias. Basic and
8(3):214–235. applied social psychology, 26(2-3):103–117.
Jim AC Everett, Nadira S Faber, and Molly Crockett. Thilo Hagendorff, Sarah Fabi, and Michal Kosinski.
2015. Preferences and beliefs in ingroup favoritism. 2023. Human-like intuitive behavior and reasoning
Frontiers in behavioral neuroscience, 9:126656. biases emerged in large language models but disap-
Chris Fife-Schaw and Julie Barnett. 2004. Measuring peared in chatgpt. Nature Computational Science,
optimistic bias. Doing social psychology research, 3(10):833–838.
pages 54–74.
Laura Hanu and Unitary team. 2020. Detoxify. Github.
Peter Fischer, Stephen Lea, Andreas Kastenmüller, To- https://github.com/unitaryai/detoxify.
bias Greitemeyer, Julia Fischer, and Dieter Frey.
2011. The process of selective exposure: Why confir- Francesca GE Happé. 1994. An advanced test of theory
matory information search weakens over time. Orga- of mind: Understanding of story characters’ thoughts
nizational Behavior and Human Decision Processes, and feelings by able autistic, mentally handicapped,
114(1):37–48. and normal children and adults. Journal of autism
and Developmental disorders, 24(2):129–154.
Cassandra Flick and Kimberly Schweitzer. 2021. In-
fluence of the fundamental attribution error on per- Peter Harris. 1996. Sufficient grounds for optimism?:
ceptions of blame and negligence. Experimental Psy- The relationship between perceived controllability
chology, 68:175–188. and optimistic bias. Journal of Social and Clinical
Psychology, 15(1):9–52.
Valerie S. Folkes. 1988. The availability heuristic
and perceived risk. Journal of Consumer Research, Martie G Haselton, Daniel Nettle, and Paul W Andrews.
15(1):13–23. 2015. The evolution of cognitive bias. The handbook
of evolutionary psychology, pages 724–746.
Robert Forsythe, Joel L Horowitz, Nathan E Savin, and
Martin Sefton. 1994. Fairness in simple bargain- Scott A Hawkins and Reid Hastie. 1990. Hindsight:
ing experiments. Games and Economic behavior, Biased judgments of past events after the outcomes
6(3):347–369. are known. Psychological bulletin, 107(3):311.
Feng Fu, Corina E Tarnita, Nicholas A Christakis, Long Qianyu He, Jie Zeng, Wenhao Huang, Lina Chen, Jin
Wang, David G Rand, and Martin A Nowak. 2012. Xiao, Qianxi He, Xunzhe Zhou, Jiaqing Liang, and
Evolution of in-group favoritism. Scientific reports, Yanghua Xiao. 2024. Can large language models
2(1):460. understand real-world complex instructions? In Pro-
ceedings of the AAAI Conference on Artificial Intelli-
Javier Fuenzalida, Gregg G. Van Ryzin, and Asmus Leth
gence, volume 38, pages 18188–18196.
Olsen. 2021. Are managers susceptible to framing
effects? an experimental study of professional judg- Xingwei He, Zhenghao Lin, Yeyun Gong, Alex Jin,
ment of performance metrics. International Public Hang Zhang, Chen Lin, Jian Jiao, Siu Ming Yiu, Nan
Management Journal, 24(3):314–329. Duan, Weizhu Chen, et al. 2023. Annollm: Making
Adrian Furnham and Hua Chu Boo. 2011. A literature large language models to be better crowdsourced
review of the anchoring effect. The Journal of Socio- annotators. arXiv preprint arXiv:2303.16854.
Economics, 40(1):35–42.
James Hedlund. 2000. Risky business: safety regu-
Adele Gabrielcik and Russell H Fazio. 1984. Priming lations, risk compensation, and individual behavior.
and frequency estimation: A strict test of the avail- Injury prevention, 6(2):82–89.
ability heuristic. Personality and Social Psychology
Bulletin, 10(1):85–89. F. Heider. 1982. The Psychology of Interpersonal Rela-
tions. Lawrence Erlbaum Associates.
Kristel M. Gallagher and John A. Updegraff. 2011.
Health Message Framing Effects on Attitudes, In- Steven J Heine and Darrin R Lehman. 1995. Cultural
tentions, and Behavior: A Meta-analytic Review. An- variation in unrealistic optimism: Does the west feel
nals of Behavioral Medicine, 43(1):101–116. more vulnerable than the east? Journal of personality
and social psychology, 68(4):595.
Isabel O Gallegos, Ryan A Rossi, Joe Barrow,
Md Mehrab Tanjim, Sungchul Kim, Franck Dernon- Pamela W Henderson and Robert A Peterson. 1992.
court, Tong Yu, Ruiyi Zhang, and Nesreen K Ahmed. Mental accounting and categorization. Organiza-
2024. Bias and fairness in large language models: A tional Behavior and Human Decision Processes,
survey. Computational Linguistics, pages 1–79. 51(1):92–117.
12
Richard J Herrnstein. 1961. Relative and absolute Daniel Kahneman and Amos Tversky. 1982. Intuitive
strength of response as a function of frequency of prediction: Biases and corrective procedures, page
reinforcement. Journal of the experimental analysis 414–421. Cambridge University Press.
of behavior, 4(3):267.
Daniel Kahneman and Amos Tversky. 2013. Prospect
Nic Hooper, Ates Erdogan, Georgia Keen, Katharine theory: An analysis of decision under risk. In Hand-
Lawton, and Louise McHugh. 2015. Perspective tak- book of the fundamentals of financial decision mak-
ing reduces the fundamental attribution error. Jour- ing: Part I, pages 99–127. World Scientific.
nal of Contextual Behavioral Science, 4(2):69–72.
Mahammed Kamruzzaman, Md Minul Islam Shovon,
John Manoogian III and Buster Benson. 2016. The and Gene Louis Kim. 2023. Investigating subtler
cognitive bias codex. Wikimedia Commons. biases in llms: Ageism, beauty, institutional, and
Wikipedia’s complete (as of 2016) list of cognitive nationality bias in generative models. arXiv preprint
biases, arranged and designed by John Manoogian arXiv:2309.08902.
III (jm3). Categories and descriptions originally by
David E Kanouse and L Reid Hanson Jr. 1972. Negativ-
Buster Benson.
ity in evaluations. In E.E. Jones, D. E. Kanouse, S.
Valins, H. H. Kelley, R. E. Nisbett,& B. Weiner (Eds.),
Tiffany A Ito, Jeff T Larsen, N Kyle Smith, and John T
Attribution: Perceiving the causes of behavior, pages
Cacioppo. 1998. Negative information weighs more
47–62. Morristown, NJ: General Learning Press.
heavily on the brain: the negativity bias in evaluative
categorizations. Journal of personality and social Heather Kappes, Ann Harvey, Terry Lohrenz, Pendleton
psychology, 75(4):887. Montague, and Tali Sharot. 2020. Confirmation bias
in the utilization of others’ opinion strength. Nature
Itay Itzhak, Gabriel Stanovsky, Nir Rosenfeld, and Neuroscience, 23:1–8.
Yonatan Belinkov. 2024. Instructed to bias:
Instruction-tuned language models exhibit emergent Ralph Katz and Thomas J Allen. 1982. Investigating
cognitive bias. Transactions of the Association for the not invented here (nih) syndrome: A look at the
Computational Linguistics, 12:771–785. performance, tenure, and communication patterns of
50 r & d project groups. R&d Management, 12(1):7–
Chuhao Jin, Kening Ren, Lingzhen Kong, Xiting Wang, 20.
Ruihua Song, and Huan Chen. 2024. Persuading
across diverse domains: a dataset and persuasion Ran Kivetz. 1999. Advances in research on mental ac-
large language model. In Proceedings of the 62nd counting and reason-based choice. Marketing Letters,
Annual Meeting of the Association for Computational 10:249–266.
Linguistics (Volume 1: Long Papers), pages 1678–
1706. Joshua Klayman. 1995. Varieties of confirmation bias.
Psychology of learning and motivation, 32:385–418.
Jia Jin, Wuke Zhang, and Mingliang Chen. 2017. How
consumers are affected by product descriptions in Doron Kliger and Andrey Kudryavtsev. 2010. The avail-
online shopping: Event-related potentials evidence of ability heuristic and investors’ reaction to company-
the attribute framing effect. Neuroscience Research, specific events. The journal of behavioral finance,
125:21–28. 11(1):50–65.
Edward E Jones and Victor A Harris. 1967. The attri- Jack L Knetsch. 1989. The endowment effect and ev-
bution of attitudes. Journal of Experimental Social idence of nonreversible indifference curves. The
Psychology, 3(1):1–24. American Economic Review, 79(5):1277–1284.
Daniel Kahneman and Amos Tversky. 1979. Prospect Sheldon J Lachman and Alan R Bass. 1985. A di-
theory: An analysis of decision under risk. Econo- rect study of halo effect. The journal of psychology,
metrica, 47(2):263–291. 119(6):535–540.
13
David Laibson. 1997. Golden eggs and hyperbolic Yusufcan Masatlioglu and Efe A Ok. 2005. Rational
discounting. The Quarterly Journal of Economics, choice with status quo bias. Journal of economic
112(2):443–478. theory, 121(1):1–29.
Ellen J Langer. 1975. The illusion of control. Journal Anne M McCarthy, F David Schoorman, and Arnold C
of personality and social psychology, 32(2):311. Cooper. 1993. Reinvestment decisions by en-
trepreneurs: Rational decision-making or escalation
Dong-Ho Lee, Jay Pujara, Mohit Sewak, Ryen W White, of commitment? Journal of business venturing,
and Sujay Kumar Jauhar. 2023. Making large lan- 8(1):9–24.
guage models better data creators. arXiv preprint
arXiv:2310.20111. Susan Miles and Victoria Scaife. 2003. Optimistic bias
and food. Nutrition research reviews, 16(1):3–19.
H. Leibenstein. 1950. Bandwagon, snob, and veblen
effects in the theory of consumers’ demand. The
Dale T Miller and Michael Ross. 1975. Self-serving
Quarterly Journal of Economics, 64(2):183–207.
biases in the attribution of causality: Fact or fiction?
Lance Leuthesser, Chiranjeev Kohli, and Katrin Harich. Psychological bulletin, 82(2):213.
1995. Brand equity: The halo effect measure. Euro-
pean Journal of Marketing, 29:57–66. Daniel Mochon and Shane Frederick. 2013. Anchoring
in sequential judgments. Organizational Behavior
Irwin P. Levin, Sandra L. Schneider, and Gary J. Gaeth. and Human Decision Processes, 122(1):69–79.
1998. All frames are not created equal: A typol-
ogy and critical analysis of framing effects. Organi- Carey K Morewedge and Colleen E Giblin. 2015. Ex-
zational Behavior and Human Decision Processes, planations of the endowment effect: an integrative
76(2):149–188. review. Trends in cognitive sciences, 19(6):339–348.
Xun Liang, Hanyu Wang, Yezhaohui Wang, Shichao MSCI and S&P Global. 2023. Global industry clas-
Song, Jiawei Yang, Simin Niu, Jie Hu, Dan Liu, sification standard (gics). A classification standard
Shunyu Yao, Feiyu Xiong, et al. 2024. Controllable jointly developed by MSCI and S&P Global for cate-
text generation for large language models: A survey. gorizing companies into sectors and industries. Pub-
arXiv preprint arXiv:2408.12599. lished March 17, 2023, retrieved October 1, 2024.
Falk Lieder, Tom Griffiths, and Noah Goodman. 2012. Joel Myerson and Sandra Hale. 1984. Practical im-
Burn-in, bias, and the rationality of anchoring. In plications of the matching law. Journal of Applied
Advances in Neural Information Processing Systems, Behavior Analysis, 17(3):367–380.
volume 25. Curran Associates, Inc.
Richard Nadeau, Edouard Cloutier, and J.-H. Guay.
Lin Long, Rui Wang, Ruixuan Xiao, Junbo Zhao, Xiao 1993. New evidence about the existence of a band-
Ding, Gang Chen, and Haobo Wang. 2024. On llms- wagon effect in the opinion formation process. In-
driven synthetic data generation, curation, and evalu- ternational Political Science Review / Revue interna-
ation: A survey. Preprint, arXiv:2406.15126. tionale de science politique, 14(2):203–213.
Ashley Luckman, Hossam Zeitoun, Andrea Isoni, Gra- Rosanna Nagtegaal, Lars Tummers, Mirko Noordegraaf,
ham Loomes, Ivo Vlaev, Nattavudh Powdthavee, and and Victor Bekkers. 2020. Designing to debias: Mea-
Daniel Read. 2021. Risk compensation during covid- suring and reducing public managers’ anchoring bias.
19: The impact of face mask usage on social distanc- Public Administration Review, 80(4):565–576.
ing. Journal of Experimental Psychology: Applied,
27(4):722.
Raymond Nickerson. 1998. Confirmation bias: A ubiq-
Dan P. Ly, Paul G. Shekelle, and Zirui Song. 2023. Evi- uitous phenomenon in many guises. Review of Gen-
dence for Anchoring Bias During Physician Decision- eral Psychology, 2:175–220.
Making. JAMA Internal Medicine, 183(8):818–823.
Richard E Nisbett and Timothy D Wilson. 1977. The
Colin MacLeod and Lynlee Campbell. 1992. Memory halo effect: Evidence for unconscious alteration of
accessibility and probability judgments: an experi- judgments. Journal of personality and social psy-
mental evaluation of the availability heuristic. Jour- chology, 35(4):250.
nal of personality and social psychology, 63(6):890.
Mahsan Nourani, Chiradeep Roy, Jeremy E Block, Don-
Olivia Macmillan-Scott and Mirco Musolesi. 2024. (ir) ald R Honeycutt, Tahrima Rahman, Eric Ragan, and
rationality and cognitive biases in large language Vibhav Gogate. 2021. Anchoring bias affects mental
models. Royal Society Open Science, 11(6):240255. model formation and user reliance in explainable ai
systems. In Proceedings of the 26th International
Keith M Marzilli Ericson and Andreas Fuster. 2014. Conference on Intelligent User Interfaces, IUI ’21,
The endowment effect. Annu. Rev. Econ., 6(1):555– page 340–350, New York, NY, USA. Association for
579. Computing Machinery.
14
Andreas Opedal, Alessandro Stolfo, Haruki Shirakami, Neal J. Roese and Kathleen D. Vohs. 2012. Hind-
Ying Jiao, Ryan Cotterell, Bernhard Schölkopf, Ab- sight bias. Perspectives on Psychological Science,
ulhair Saparov, and Mrinmaya Sachan. 2024. Do 7(5):411–426. PMID: 26168501.
language models exhibit the same cognitive biases in
problem solving as human learners? arXiv preprint J.H. Rohlfs. 2003. Bandwagon Effects in High-
arXiv:2401.18070. technology Industries. MIT Press.
Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Benjamin D Rosenberg and Jason T Siegel. 2018. A 50-
Carroll Wainwright, Pamela Mishkin, Chong Zhang, year review of psychological reactance theory: Do
Sandhini Agarwal, Katarina Slama, Alex Ray, et al. not read this article. Motivation Science, 4(4):281.
2022. Training language models to follow instruc- Lee Ross. 1977. The intuitive psychologist and his
tions with human feedback. Advances in neural in- shortcomings: Distortions in the attribution process.
formation processing systems, 35:27730–27744. In Advances in experimental social psychology, vol-
ume 10, pages 173–220. Elsevier.
Gary Pan, Shan L Pan, Michael Newman, and Donal
Flynn. 2006. Escalation and de-escalation of commit- David Rozado. 2024. The political preferences of llms.
ment: a commitment transformation analysis of an arXiv preprint arXiv:2402.01789.
e-government project. Information Systems Journal,
16(1):3–21. Paul Rozin and Edward B Royzman. 2001. Negativity
bias, negativity dominance, and contagion. Personal-
Mihir Parmar, Swaroop Mishra, Mor Geva, and Chitta ity and social psychology review, 5(4):296–320.
Baral. 2023. Don’t blame the annotator: Bias al-
ready starts in the annotation instructions. In Pro- Ariel Rubinstein. 2003. “economics and psychology”?
ceedings of the 17th Conference of the European the case of hyperbolic discounting. International
Chapter of the Association for Computational Lin- Economic Review, 44(4):1207–1216.
guistics, pages 1779–1789, Dubrovnik, Croatia. As-
sociation for Computational Linguistics. Arleen Salles, Kathinka Evers, and Michele Farisco.
2020. Anthropomorphism in ai. AJOB neuroscience,
Sam Peltzman. 1975. The effects of automobile safety 11(2):88–95.
regulation. Journal of political Economy, 83(4):677– Ömür Saltık, Wasim Rehman, Rıdvan Söyü, Suleyman
725. Degirmen, and Ahmet Sengonul. 2023. Predicting
loss aversion behavior with machine-learning meth-
Stephen Pilli. 2023. Exploring conversational agents ods. Humanities and Social Sciences Communica-
as an effective tool for measuring cognitive biases in tions, 10.
decision-making. In 2023 10th International Confer-
ence on Behavioural and Social Computing (BESC), William Samuelson and Richard Zeckhauser. 1988. Sta-
pages 1–5. IEEE. tus quo bias in decision making. Journal of risk and
uncertainty, 1:7–59.
Sivan Portal, Russell Abratt, and Michael Bendixen.
2018. Building a human brand: Brand anthropo- Abulhair Saparov and He He. 2022. Language models
morphism unravelled. Business Horizons, 61(3):367– are greedy reasoners: A systematic formal analysis of
374. chain-of-thought. arXiv preprint arXiv:2210.01240.
Diane Proudfoot. 2011. Anthropomorphism and ai: Tur- Samuel Schmidgall, Carl Harris, Ime Essien, Daniel Ol-
ing’s much misunderstood imitation game. Artificial shvang, Tawsifur Rahman, Ji Woong Kim, Rojin Zi-
Intelligence, 175(5-6):950–957. aei, Jason Eshraghian, Peter Abadir, and Rama Chel-
lappa. 2024. Addressing cognitive bias in medical
Machel Reid, Nikolay Savinov, Denis Teplyashin, language models. arXiv preprint arXiv:2402.08113.
Dmitry Lepikhin, Timothy Lillicrap, Jean-baptiste
Alayrac, Radu Soricut, Angeliki Lazaridou, Orhan Fi- Jeffrey B Schmidt and Roger J Calantone. 2002. Escala-
rat, Julian Schrittwieser, et al. 2024. Gemini 1.5: Un- tion of commitment during new product development.
locking multimodal understanding across millions of Journal of the academy of marketing science, 30:103–
tokens of context. arXiv preprint arXiv:2403.05530. 118.
Rüdiger Schmitt-Beck. 2015. Bandwagon Effect. John
Heidi R. Riggio and Amber L. Garcia. 2009. The power Wiley & Sons, Ltd.
of situations: Jonestown and the fundamental attribu-
tion error. Teaching of Psychology, 36(2):108–112. Hamish GW Seaward and Simon Kemp. 2000. Opti-
mism bias and student debt. New Zealand journal of
Morgane Riviere, Shreya Pathak, Pier Giuseppe psychology, 29(1):17–19.
Sessa, Cassidy Hardin, Surya Bhupatiraju, Léonard
Hussenot, Thomas Mesnard, Bobak Shahriari, Sakib Shahriar, Brady D Lund, Nishith Reddy Man-
Alexandre Ramé, et al. 2024. Gemma 2: Improv- nuru, Muhammad Arbab Arshad, Kadhim Hayawi,
ing open language models at a practical size. arXiv Ravi Varma Kumar Bevara, Aashrith Mannuru, and
preprint arXiv:2408.00118. Laiba Batool. 2024. Putting gpt-4o to the sword:
15
A comprehensive evaluation of language, vision, Christina Steindl, Eva Jonas, Sandra Sittenthaler, Eva
speech, and multimodal proficiency. Applied Sci- Traut-Mattausch, and Jeff Greenberg. 2015. Under-
ences, 14(17):7782. standing psychological reactance. Zeitschrift für Psy-
chologie.
Jonathan Shalev. 2000. Loss aversion equilibrium. In-
ternational Journal of Game Theory, 29:269–287. Joachim Stöber. 2001. The social desirability scale-17
(sds-17): Convergent validity, discriminant validity,
Hersh Shefrin and Meir Statman. 1985. The dispo- and relationship with age. European Journal of Psy-
sition to sell winners too early and ride losers too chological Assessment, 17(3):222.
long: Theory and evidence. The Journal of finance,
Rickard Stureborg, Dimitris Alikaniotis, and Yoshi
40(3):777–790.
Suhara. 2024. Large language models are in-
consistent and biased evaluators. arXiv preprint
James Shepperd, Wendi Malone, and Kate Sweeny.
arXiv:2405.01724.
2008. Exploring causes of the self-serving bias. So-
cial and Personality Psychology Compass, 2(2):895– Alaina N Talboy and Elizabeth Fuller. 2023. Challeng-
908. ing the appearance of machine intelligence: Cog-
nitive bias in llms and best practices for adoption.
Emmanuel Marques Silva, Rafael de Lacerda Moreira, arXiv preprint arXiv:2304.01358.
and Patricia Maria Bortolon. 2023. Mental account-
ing and decision making: a systematic literature re- Zhen Tan, Alimohammad Beigi, Song Wang, Ruocheng
view. Journal of Behavioral and Experimental Eco- Guo, Amrita Bhattacharjee, Bohan Jiang, Mansooreh
nomics, 107:102092. Karami, Jundong Li, Lu Cheng, and Huan Liu. 2024.
Large language models for data annotation: A survey.
Herbert A Simon. 1990. Bounded rationality. Utility arXiv preprint arXiv:2402.13446.
and probability, pages 15–18.
Richard Thaler. 1980. Toward a positive theory of con-
Karan Singhal, Shekoofeh Azizi, Tao Tu, S Sara Mah- sumer choice. Journal of economic behavior & orga-
davi, Jason Wei, Hyung Won Chung, Nathan Scales, nization, 1(1):39–60.
Ajay Tanwani, Heather Cole-Lewis, Stephen Pfohl,
et al. 2023. Large language models encode clinical Richard Thaler. 1985. Mental accounting and consumer
knowledge. Nature, 620(7972):172–180. choice. Marketing Science, 4(3):199–214.
Richard H Thaler. 1999. Mental accounting matters.
Dustin Sleesman, Anna Lennard, Gerry Mcnamara, Journal of Behavioral decision making, 12(3):183–
and Donald Conlon. 2018. Putting escalation of 206.
commitment in context: A multi-level review and
analysis. Academy of Management Annals, 12:an- Suzanne C Thompson. 1999. Illusions of control: How
nals.2016.0046. we overestimate our personal influence. Current Di-
rections in Psychological Science, 8(6):187–190.
Mark Snyder and William Swann. 1978. Hypothesis
testing in social judgment. Journal of Personality Edward Lee Thorndike. 1920. A constant error in psy-
and Social Psychology, 36:1202–1212. chological ratings. Journal of Applied Psychology,
4:25–29.
Melvin Snyder and Arthur Frankel. 1976. Observer bias:
A stringent test of behavior engulfing the field. Jour- Lindia Tjuatja, Valerie Chen, Tongshuang Wu, Ameet
nal of Personality and Social Psychology, 34:857– Talwalkwar, and Graham Neubig. 2024. Do llms
864. exhibit human-like response biases? a case study in
survey design. Transactions of the Association for
Melvin L Snyder, Walter G Stephan, and David Rosen- Computational Linguistics, 12:1011–1026.
field. 1976. Egotism and attribution. Journal of
Sabrina M. Tom, Craig R. Fox, Christopher Trepel, and
Personality and Social Psychology, 33(4):435.
Russell A. Poldrack. 2007. The neural basis of loss
aversion in decision-making under risk. Science,
Yifan Song, Guoyin Wang, Sujian Li, and Bill Yuchen
315(5811):515–518.
Lin. 2024. The good, the bad, and the greedy: Eval-
uation of llms should not ignore non-determinism. Xiaoyu Tong, Rochelle Choenni, Martha Lewis, and
arXiv preprint arXiv:2407.10457. Ekaterina Shutova. 2024. Metaphor understanding
challenge dataset for LLMs. In Proceedings of the
Barry M. Staw. 1976. Knee-deep in the big muddy: a 62nd Annual Meeting of the Association for Com-
study of escalating commitment to a chosen course putational Linguistics (Volume 1: Long Papers),
of action. Organizational Behavior and Human Per- pages 3517–3536, Bangkok, Thailand. Association
formance, 16(1):27–44. for Computational Linguistics.
Barry M. Staw. 1981. The escalation of commitment Amos Tversky and Daniel Kahneman. 1973. Availabil-
to a course of action. The Academy of Management ity: A heuristic for judging frequency and probability.
Review, 6(4):577–587. Cognitive Psychology, 5(2):207–232.
16
Amos Tversky and Daniel Kahneman. 1974. Judgment Yizhong Wang, Yeganeh Kordi, Swaroop Mishra, Al-
under uncertainty: Heuristics and biases. Science, isa Liu, Noah A Smith, Daniel Khashabi, and Han-
185(4157):1124–1131. naneh Hajishirzi. 2022. Self-instruct: Aligning lan-
guage models with self-generated instructions. arXiv
Amos Tversky and Daniel Kahneman. 1981. The fram- preprint arXiv:2212.10560.
ing of decisions and the psychology of choice. Sci-
ence, 211(4481):453–458. Zifeng Wang, Chun-Liang Li, Vincent Perot, Long T
Le, Jin Miao, Zizhao Zhang, Chen-Yu Lee, and
Amos Tversky and Daniel Kahneman. 1983. Exten- Tomas Pfister. 2024b. Codeclm: Aligning language
sional versus intuitive reasoning: The conjunction models with tailored synthetic data. arXiv preprint
fallacy in probability judgment. Psychological re- arXiv:2404.05875.
view, 90(4):293.
P. C. Wason. 1960. On the failure to eliminate hypothe-
Amos Tversky and Daniel Kahneman. 1991. Loss Aver- ses in a conceptual task. Quarterly Journal of Exper-
sion in Riskless Choice: A Reference-Dependent imental Psychology, 12(3):129–140.
Model*. The Quarterly Journal of Economics,
Peter C. Wason. 1966. Reasoning. In Peter C. Wason,
106(4):1039–1061.
editor, New Horizons in Psychology, pages 135–151.
Penguin Books.
Amrisha Vaish, Tobias Grossmann, and Amanda Wood-
ward. 2008. Not all emotions are created equal: the Martin Weber and Colin F Camerer. 1998. The dispo-
negativity bias in social-emotional development. Psy- sition effect in securities trading: An experimental
chological bulletin, 134(3):383. analysis. Journal of Economic Behavior & Organi-
zation, 33(2):167–184.
Max van Duijn, Bram van Dijk, Tom Kouwenhoven,
Werner de Valk, Marco Spruit, and Peter van der Neil D. Weinstein. 1989. Optimistic biases about per-
Putten. 2023. Theory of mind in large language mod- sonal risks. Science, 246(4935):1232–1233.
els: Examining performance of 11 state-of-the-art
models vs. children aged 7-10 on advanced tests. In Christopher G Wetzel, Timothy D Wilson, and James
Proceedings of the 27th Conference on Computa- Kort. 1981. The halo effect revisited: Forewarned
tional Natural Language Learning (CoNLL), pages is not forearmed. Journal of Experimental Social
389–402, Singapore. Association for Computational Psychology, 17(4):427–439.
Linguistics.
Gerald JS Wilde. 1982. The theory of risk homeostasis:
Lyn M Van Swol. 2007. Perceived importance of in- implications for safety and health. Risk analysis,
formation: The effects of mentioning information, 2(4):209–225.
shared information bias, ownership bias, reiteration,
and confirmation bias. Group processes & intergroup Anna Winterbottom, Hilary L Bekker, Mark Conner,
relations, 10(2):239–256. and Andrew Mooney. 2008. Does narrative informa-
tion bias individual’s decision making? a systematic
A Vaswani. 2017. Attention is all you need. Advances review. Social science & medicine, 67(12):2079–
in Neural Information Processing Systems. 2088.
17
Arab Emirates. Association for Computational Lin- is better than B, followed by a conclusion that A is
guistics. clearly better than B, representing the model’s prior
Jiayi Ye, Yanbo Wang, Yue Huang, Dongping Chen, belief. We then show three pieces of new evidence
Qihui Zhang, Nuno Moniz, Tian Gao, Werner Geyer, suggesting that B is better than A. After seeing that
Chao Huang, Pin-Yu Chen, et al. 2024. Justice new evidence, the model is asked for its revised
or prejudice? quantifying biases in llm-as-a-judge. preference for either A or B on a 7-step Likert scale
arXiv preprint arXiv:2410.02736.
σ1 with the midpoint representing indifference.
Alex Young, Bei Chen, Chao Li, Chengen Huang, To account for any objective differences in the
Ge Zhang, Guanwei Zhang, Heng Li, Jiangcheng strengths of the evidence for A and B, we reverse
Zhu, Jianqun Chen, Jing Chang, et al. 2024. Yi:
the order of A and B between control and treat-
Open foundation models by 01. ai. arXiv preprint
arXiv:2403.04652. ment. Only if the model consistently prefers the
alternative that was introduced first, conservatism
Yue Yu, Yuchen Zhuang, Jieyu Zhang, Yu Meng, is present. We measure the strength of the bias
Alexander J Ratner, Ranjay Krishna, Jiaming Shen,
and Chao Zhang. 2024. Large language model as as the consistent preference of the first alternative
attributed training data generator: A tale of diversity over the second one.
and bias. Advances in Neural Information Processing
Systems, 36. B.2 Anchoring
Yue Zhang, Yafu Li, Leyang Cui, Deng Cai, Lemao Liu, Anchoring, also known as anchoring bias or an-
Tingchen Fu, Xinting Huang, Enbo Zhao, Yu Zhang, choring effect, is a phenomenon of making “esti-
Yulong Chen, et al. 2023. Siren’s song in the ai ocean: mates, which are biased toward the initially pre-
a survey on hallucination in large language models. sented values” (Tversky and Kahneman, 1974),
arXiv preprint arXiv:2309.01219.
potentially irrelevant ones. This effect has been
Chujie Zheng, Hao Zhou, Fandong Meng, Jie Zhou, and elicited in several settings (Furnham and Boo,
Minlie Huang. 2023. Large language models are not 2011). Anchoring is investigated across different
robust multiple choice selectors. In The Twelfth Inter-
national Conference on Learning Representations.
domains, including finance (Campbell and Sharpe,
2009), management (Nagtegaal et al., 2020), health-
Jeffrey Zhou, Tianjian Lu, Swaroop Mishra, Sid- care (Ly et al., 2023), and artificial intelligence
dhartha Brahma, Sujoy Basu, Yi Luan, Denny Zhou, (Lieder et al., 2012; Nourani et al., 2021).
and Le Hou. 2023. Instruction-following evalu-
ation for large language models. arXiv preprint We approach the testing by directly following
arXiv:2311.07911. the comparative judgment paradigm (Mochon and
Frederick, 2013). In control and treatment, the
A Framework: Application Examples LLM is prompted to estimate a variable. Addition-
We demonstrate two examples of the framework’s ally, the treatment variant contains an instruction
universality feature. Table 4 features an adaptation to first evaluate the variable relative to the provided
of the Bandwagon Effect testing procedure to the numerical value. This value serves as the anchor in
medical domain. Table 3 provides an example of a the test design.
common testing procedure from the theory of mind The anchoring effect is thus identified for devi-
research. ations between the estimations in the anchor-free
and anchored formulation. The answers are ob-
B Cognitive Biases tained on an 11-point percentage scale σ2 .
18
the model is told what the group is (e.g., Muslims), dentially and not be shared with anyone. In treat-
whereas in control, it is not. ment, we note that the LLM’s answer will be made
The model can choose from four options describ- public and can be linked back to the LLM. We
ing the characteristics of the group, where two sample statements from the the M-C SDS (Crowne
options represent characteristics stereotypical of and Marlowe, 1960). From the scale, we remove
that group and two options represent characteris- 17 statements describing emotions, thoughts, or
tics atypical of that group. For each pair of options, real-world interactions which are not applicable to
one is typical of people overall, while the other is LLMs, leaving 16 statements testable with LLMs.
atypical of people overall. If the model switches We obtain answers on a 7-point Likert scale σ1 .
from choosing an atypical characteristic to a stereo- The metric takes a value of 1 only if the model
typical one once the particular group is known, we self-reports undesirable behavior in control, the
conclude that the model exhibits stereotyping. In anonymous setting, but then chooses desirable be-
the inverse case, it would exhibit negative stereo- havior in treatment, the public setting, and −1 in
typing. We obtain answers on a 7-point Likert scale the reverse case.
σ1 .
B.5 Loss Aversion
B.4 Social Desirability Bias Proposed by Kahneman and Tversky (1979), loss
Social desirability bias is “the tendency to present aversion is present when the “disutility of giving up
oneself and one’s social context in a way that is an object is greater than the utility associated with
perceived to be socially acceptable” (Bergen and acquiring it” (Kahneman et al., 1991), i.e., when
Labonté, 2020). It is often studied in the context losses are perceived to be psychologically more
of surveys where it refers to the tendency to an- powerful than gains. Well-established, this bias has
swer survey questions in a way that will be viewed been investigated in both risky and riskless (Tver-
favorably by others (Krumpal, 2013). Edwards sky and Kahneman, 1991) contexts from various
(1953) introduced the notion of social desirability perspectives, including neuroscience (Tom et al.,
describing the “relationship between the judged 2007), game theory (Shalev, 2000), and machine
desirability of a trait and the probability that the learning (Saltık et al., 2023).
trait will be endorsed”. The bias has been studied We base our testing on the variation of the stan-
extensively in survey respondents self-reporting dard Samuelson’s colleague problem formulated
their personality traits showing a “tendency of sub- in Ert and Erev (2013). The model is presented
jects to attribute to themselves statements which with a choice of two options with the material out-
are desirable and reject those which are undesirable” comes f1,2 designed as follows (a > 0 denotes the
(Edwards, 1957). commodity amount, p denotes probability):
Common testing procedures rely on scales such
as the Social Desirability Scale (SDS) (Edwards, f1 = a, a > 0, i.e., guaranteed gain (3)
1957), the Marlowe-Crowne Social Desirability
Scale (M-C SDS) (Crowne and Marlowe, 1960), or (
1
the Social Desirability Scale-17 (SDS-17) (Stöber, λa, λ > 2 with p = 2
f2 = 1
(4)
2001), which include a number of statements about −a, with p = 2
personality traits which are either clearly socially
desirable or undesirable, e.g., “I’m always willing The second option, while being risky due to a
to admit it when I make a mistake” (Crowne and potential loss, yields a more profitable outcome in
Marlowe, 1960). These scales can be used to test expectation. In control and treatment, we switch
how many times a subject responds with a socially the positions of the two options to account for the
desirable answer. response bias. Loss aversion is thus present when
Our test procedure is inspired by Albert and the LLM consistently opts for the deterministic
Tullis (2013), who report that people tend to follow option, and we utilize a 7-point Likert scale σ1 to
socially desirable norms more strictly in public set- obtain answers.
tings as opposed to anonymous settings. We ask
the LLM to express whether a statement is true B.6 Halo Effect
or false as it pertains to the LLM. In control, we The halo effect is originally defined in Thorndike
note that the LLM’s answer will be treated confi- (1920) and is commonly known as “the influence
19
of a global evaluation on evaluations of individ- (e.g., “consider doing it responsibly”) while the
ual attributes” (Nisbett and Wilson, 1977), even high-threat scenario demands a change of behavior
when there is sufficient evidence for their indepen- (e.g., “you have to stop it”).
dence. Cooper (1981) generalizes the definition to To measure the effect, we present the model with
the presence of correlation between two indepen- options describing different levels of engagement
dent attributes. Notably persistent (Wetzel et al., with the behavior. An increased engagement with
1981), this bias is well-studied in the fields of con- the behavior from the low-threat to the high-threat
sumer science (Leuthesser et al., 1995), public rela- variant indicates the presence of reactance (i.e.,
tions (Coombs and Holladay, 2006), and education an adverse response to the threat). We obtain the
(Abikoff et al., 1993). answers to the effect on an 11-point percentage
We build on the testing procedure of Lachman scale σ2 .
and Bass (1985). In both control and treatment,
an asset is presented to the LLM, and the model B.8 Confirmation Bias
is prompted to evaluate a concrete attribute of this Originally described by Wason (1960), confirma-
asset. In treatment, the halo is additionally intro- tion bias commonly refers to the “inclination to
duced: a separate independent attribute of this asset discount information that contradicts past judg-
is described either positively or negatively. ments” (Kappes et al., 2020). Confirmation bias is
The halo effect is present in cases of the estima- known to arise during the search and the interpre-
tion shift in treatment compared to control, either a tation of information, as well as their combination
positive one provided with a positive halo or a neg- (Klayman, 1995; Nickerson, 1998). Approaches
ative one given a negative halo. The symmetrical to testing this bias include variations of the classi-
behavior results in the opposite effect. We obtain cal Wason selection task (Wason, 1966), two-phase
answers to the halo effect test on a 7-point Likert evidence-seeking paradigms (Cook and Smallman,
scale σ1 . 2008; Fischer et al., 2011), and weighting of pro-
vided evidence (Snyder and Swann, 1978; Beebe
B.7 Reactance and Pherson, 2011).
Reactance refers to “an unpleasant motivational We directly employ the latter technique for the
arousal that emerges when people experience a testing. In the control and treatment procedures,
threat to or loss of their free behaviors” (Steindl the model is associated with a proposal and is pre-
et al., 2015). Rosenberg and Siegel (2018) present sented with a set of arguments against it. In control,
an extensive review of reactance theory. Reac- the model is said to have not yet decided on its
tance theory was first proposed by Brehm (1966), proposal. On the contrary, in treatment, the LLM
who found that individuals tend to be motivated is prompted to have already made the decision, i.e.,
to regain their behavioral freedoms when these this decision is considered the model’s past judg-
freedoms are reduced or threatened (Brehm, 1966; ment. In both variants, the LLM is prompted to
Brehm and Brehm, 2013). The level of reactance select the number of presented arguments that are
is influenced by the importance of the threatened relevant while and after making the decision in
freedom and the strength of the threat as perceived control and treatment, respectively.
by the individual (Steindl et al., 2015). The answers of the LLM to the confirmation
Our test design is based on the procedure pro- bias test are obtained on an 11-point scale σ2 . The
posed by Dillard and Shen (2005), who measure metric reflects the extent to which this selection
reactance in the different responses of subjects to is imbalanced between the cases of absence and
either a low-threat or a high-threat scenario. We presence of the past judgment.
describe a behavior where the test taker previously
had the freedom to choose if and how often to en- B.9 Not Invented Here
gage in this behavior. This is followed by a number The not-invented-here syndrome (NIH) is com-
of facts describing the negative consequences of monly described as an attitudinal bias against the
this behavior. In control, these facts are presented knowledge that an individual perceives as external
as part of a low-threat framing and in treatment as (Katz and Allen, 1982; Kostova and Roth, 2002).
part of a high-threat framing. The framework by Antons and Piller (2015) de-
Specifically, our low-threat scenario recom- picts two key elements of this bias: first, the source
mends that the subject changes his/her behavior of knowledge, distinguishing organizational, con-
20
textual (disciplinary), and spatial (geographical) (E) no description of an additional factor.
externality. Second, the underestimation of the We measure the illusion of control as any success
value of this knowledge or the overestimation of probability judged by the model that exceeds the
the costs of its obtainment. There may be different objective success probability x. The answers are
underlying mechanisms causing this syndrome, in- obtained on an 11-point percentage scale σ2 .
cluding ego-defensive (e.g., Baer and Brown, 2012)
or utilitarian functions (e.g., Argote et al., 2003). B.11 Survivorship Bias
Our test follows the concept of value estimation Survivorship bias is a form of selection bias that
by introducing a decision scenario and asking for can occur when we only focus on data from sub-
the evaluation of a respective proposal. In control, jects who “proceeded past a selection or elimina-
the test case informs that one proposal is suggested tion process” (a.k.a. “survivors”) “while overlook-
by a colleague in the decision-maker’s own team. ing those who did not” (Elston, 2021). Hence, sur-
In treatment, the statement is changed to indicate vivorship bias can cause us to draw conclusions
the external source of the proposal, whereby we about the general population of subjects that are
sample the type of externality to be either organiza- biased toward the survivors. The bias was first de-
tional, contextual, or spatial. For spatial externality scribed by statistician Wald (1943) who studied
we additionally sample the country of the colleague. World War II aircraft and the damage they incurred
Hereby, we include the three most populated coun- during battle. Since then, survivorship is often ob-
tries per continent (only two for North America served in financial and investment contexts (Brown
and Oceania). et al., 1992; Ball and Watts, 1979).
A lower evaluation of the proposal, when it is To test the presence of the bias in LLMs, we de-
described as from an external source, indicates the scribe a decision-making task that involves choos-
presence of the not-invented-here syndrome. The ing somehow good entities from a pool that con-
answers are obtained on a 7-point Likert scale σ1 . tains both good and bad entities. We then introduce
a characteristic of these entities that could be used
B.10 Illusion of Control to separate good from bad entities and define what
An illusion of control is “an expectancy of a per- percentages xgood and xbad of the entities have this
sonal success probability inappropriately higher characteristic among the good and the bad entities,
than the objective probability would warrant” respectively. xgood and xbad are sampled from the
(Langer, 1975). In other words, people tend to same narrow interval and are very close together.
overestimate their ability to control events (Thomp- In control, we report both xgood and xbad to the
son, 1999). Langer (1975), who named the illusion model, whereas in treatment, we only report xgood ,
of control, reports that factors typical of skill situa- reflecting a situation where we only focus on the
tions, such as competition, choice, familiarity, and survivors. Lastly, we ask the model how important
involvement, can cause individuals to feel inappro- it thinks the characteristic is to distinguish good
priately confident. from bad entities.
Our test design builds onto the findings by Specifically, we sample both xgood and xbad
Langer (1975). We describe an activity that typi- from a relatively small interval [0.90, 0.95] to sim-
cally has some success probability x. We then ask ulate a situation where the difference is likely not
the model to judge its own success probability as- statistically significant between the two groups and
suming that it would conduct the activity. We also both, xgood and xbad , are large.
add factors from skill situations to the description. We measure the strength of survivorship bias
Specifically, we describe a situation where the as the excess importance of the characteristic in
model has recently been hired by an organization treatment over control as judged by the model. The
to supervise a business activity which typically has answers are obtained on a 7-point Likert scale σ1 .
a success probability of x = 50%. To enrich the
situation with bias-inducing factors, we randomly B.12 Escalation of Commitment
add either a description of (A) how the model is First examined in Staw (1976), escalation of com-
competing against others, (B) how it has full free- mitment, also known as commitment bias, refers
dom of choice regarding how to run the activity, to “the act of ’carrying on’ with questionable or
(C) how it is highly familiar with the activity, (D) failing courses of action” (Sleesman et al., 2018).
how it will be deeply involved in the execution, or Due to its nature, the bias has been extensively
21
studied, among others, in finance (McCarthy et al., to violations of the normative economic principle
1993), governance (Pan et al., 2006), and research of fungibility” (Kivetz, 1999), i.e., the same re-
& development (Schmidt and Calantone, 2002). sources in different mental accounts are not equiva-
Our procedure is based on the findings of Staw lent. An extensive review of various facets of this
(1981), which emphasizes the connection between effect and its presence in different applications is
escalation of commitment and responsibility. In assembled in Silva et al. (2023).
this paradigm, the model is presented with a de- We frame our test in direct accordance with the
cision that has been made in the past and evi- “theater ticket” experiment in Tversky and Kah-
dence suggesting that this decision should have neman (1981), which is a standard technique to
been made differently. We then ask the model for elicit mental accounting (Thaler, 1999; Henderson
its intention to change the decision. In the con- and Peterson, 1992). In both variants, an invest-
trol variant, the past decision is attributed to the ment decision is described. In control, this invest-
LLM, and in the treatment variant — to another ment is lost irrevocably, and the model is prompted
independent actor. to choose whether or not to make another such
Greater commitment to decisions made by the investment to compensate for the lost one. The
subject indicates the presence of the bias. The treatment variant, in turn, features a separate, in-
answers to the effect’s testing are measured on an dependent loss of the same amount. The LLM is
11-point percentage scale σ2 . then prompted to decide if the initial investment
decision nonetheless holds or not.
B.13 Information Bias
A discrepancy in these two decisions indicates
Information bias denotes the heuristic to request
the presence of mental accounting, i.e., it shows
new information even when none of the poten-
that the equal losses described belong to different,
tial findings could change the basis for action,
non-equivalent mental accounts. The answers are
which was demonstrated for the medical domain by
obtained on a 7-point Likert scale σ1 .
Baron et al. (1988). In their experiments, subjects
chose to run medical tests that could not change
the prior treatment decision for the hypothetical B.15 Optimism Bias
patients. The term information bias is, however,
Optimism bias represents the “tendency to overes-
also employed as a catch-all phrase for a group of
timate the likelihood of favorable future outcomes
information-related biases (e.g., confirmation bias),
and underestimate the likelihood of unfavorable
and further specifications exist, such as narrative
future outcomes” (Bracha and Brown, 2012). This
information bias (Winterbottom et al., 2008) or
effect is ubiquitous (Weinstein, 1989) and impacts
shared information bias (Van Swol, 2007).
diverse aspects of human activities: ethics in re-
For our tests, we employ a simplified version
search (Chalmers and Matthews, 2006), finance
of the experiment by Baron et al. (1988), with a
(Seaward and Kemp, 2000), people’s health (Miles
description of a decision event and a currently con-
and Scaife, 2003) and safety (DeJoy, 1989). Fife-
sidered course of action. In control, we ask the
Schaw and Barnett (2004) identifies two main ap-
model about its confidence in advancing with this
proaches to measure the optimism bias: direct and
course. In treatment, we instead ask if the model
indirect comparisons.
needs any additional information to advance with
this course. Answers indicating strong confidence For our testing, we adopt the latter technique
in the control variant and a high need for additional (Heine and Lehman, 1995; Harris, 1996). Either
information in the treatment variant suggest the a positive or a negative situation is introduced. In
presence of information bias. control and treatment, the model is prompted to
We obtain answers to the information bias test estimate the likelihood of facing such a situation for
on a 7-point Likert scale σ1 . an abstract subject and the LLM itself, respectively.
As in the definition of the optimism bias, we
B.14 Mental Accounting consider positive and negative shifts in estimation
Proposed by Thaler (1985), mental accounting is for the corresponding types of circumstances to
described as “a cognitive process whereby people be indicators of the optimism bias. The answers
treat resources differently depending on how they to the test are given by the model on an 11-point
are labeled and grouped, which consequently leads percentage scale σ2 .
22
B.16 Status-Quo Bias B.18 Self-Serving Bias
The “tendency to attribute success to internal fac-
Status quo bias is known as a disproportionate pref-
tors and attribute failure to external factors” is
erence for the current state of affairs, the status
known as the self-serving bias (Bradley, 1978).
quo, over other alternatives that may be available
Two motivations, namely self-enhancement and
(Samuelson and Zeckhauser, 1988). The status
self-recognition, are proposed to explain such at-
quo often serves as a reference point against which
tribution (Shepperd et al., 2008). As a widespread
other alternatives are evaluated (Masatlioglu and
bias (Blaine and Crocker, 1993), self-serving bias
Ok, 2005).
is targeted in a number of experiment approaches
Our test design introduces a decision task with (Campbell and Sedikides, 1999).
two options where one option is presented as the Our testing stems from the achievement task
status quo and the other as an alternative. To ac- paradigm in Miller and Ross (1975); Snyder et al.
count for any natural preference the model may (1976). The test features a task, which is intro-
have for one option over the other and isolate only duced as being failed or successfully completed
the status quo bias, we switch the option that is by the model in the control and treatment variants,
marked the status quo between control and treat- respectively. The LLM is then prompted to assess
ment. the extent to which its performance in this task is
We measure status quo bias when the model explained by internal factors.
consistently prefers the option marked as the status The discrepancy between control and treatment
quo in both, control and treatment, even though estimates points to the presence of self-serving bias,
the options are switched. We obtain answers in the and it is thus quantified on the basis of answers
testing procedure on a 7-point Likert scale σ1 . obtained on a 7-point Likert scale σ1 .
23
difference between treatment and control answers. either the presence of bandwagon effect (in case of
alignment with the majority option) or its opposite
B.20 Risk Compensation variant, sometimes called snob effect (Leibenstein,
Risk compensation, also known as Peltzmann effect, 1950). The answers to the test are obtained on a
is the tendency to compensate additional safety im- 7-point Likert scale σ1 .
posed through regulation by riskier behavior (Hed-
lund, 2000). One hypothesis states that there exists B.22 Endowment Effect
a personal target level of risk (Wilde, 1982), while Coined by Thaler (1980), the endowment effect
the effect has also been attributed to rational eco- refers to one’s inclination “to demand much more
nomic behavior (Peltzman, 1975). In their review, to give up an object than one would be willing to
Hedlund (2000) conclude that risk compensation pay to acquire it” (Kahneman et al., 1991). Several
occurs in some contexts while it is absent in others, cognitive origins for the effect have been proposed
depending on four factors influencing risk compen- in Morewedge and Giblin (2015). Two predomi-
sating behavior: visibility of the safety measure, its nant strategies to assess the endowment effect are
perceived effect, motivation for behavior change, the exchange paradigm (Knetsch, 1989) and the
and personal control of the situation. Risk compen- valuation paradigm (Marzilli Ericson and Fuster,
sation has almost exclusively been discussed with 2014).
respect to personal injury and health risks, most re- In our experiment, we follow the latter approach
cently for the case of face masks during COVID-19 (Kahneman et al., 1990). In control, the LLM is
(Luckman et al., 2021). prompted to evaluate the minimum amount it is
In our test design, a decision-making scenario willing to accept (WTA) to give up the asset it
is described along with a risky option and the per- owns. Symmetrically, in the treatment variant, we
sonal risk attached to this choice. In the control, the estimate the model’s maximum willingness to pay
test case directly asks for the probability of going (WTP) to acquire the same asset, which, in this
ahead with the risky choice. The treatment includes case, it does not possess initially.
an additional statement about a new regulation by The normalized difference between WTA and
the organization reducing the risk. WTP (options are provided on an 11-point percent-
The difference in probability of the risky be- age scale σ2 ) quantifies the endowment effect.
havior between control and treatment indicates the
presence and strength of a risk compensation effect. B.23 Framing Effect
The answers are obtained on an 11-point percent- “Shifts of preference when the same problem is
age scale σ2 , framed in different ways” (Tversky and Kahneman,
1981) denote the presence of the framing effect. In
B.21 Bandwagon Effect the classification by Levin et al. (1998), three types
The bandwagon effect denotes the tendency to of framing, namely goal, attribution, and risk, are
change and adopt opinions, habits, and behavior identified to be susceptible to the effect. This cog-
according to the majority (Leibenstein, 1950). This nitive bias has been studied in contexts including
effect has been observed in various processes, in- healthcare (Gallagher and Updegraff, 2011), poli-
cluding politics (Schmitt-Beck, 2015) and man- tics (Druckman, 2001), and consumer science (Jin
agement (Rohlfs, 2003). Several paradigms have et al., 2017).
been proposed for eliciting the bandwagon effect Our testing strategy follows directly from the
(Bindra et al., 2022). attribute framing effect definition and replicates the
We adopt the method by Nadeau et al. (1993). study conducted in Fuenzalida et al. (2021). The
In the test, the model is presented with a task and model is prompted to perform an evaluation given a
two opinions, each suggesting a distinct solution. quantitative metric measured in percent. In control
In the control and treatment variants, both opinions and treatment, this attribute is framed differently:
are labeled alternatingly; a single arbitrary label is we employ positive (value v of the initial metric)
consistently attributed to the majority at both stages. and negative (value 1 − v of the opposite metric)
In each case, the LLM is prompted to choose the framings, respectively.
preferred point of view. As descriptions are essentially identical in both
A switch in the model’s selection indicates the variants, an inconsistency in the LLM’s evaluation
absence of the bias, while consistent choices show serves as an indicator of the framing effect. The
24
answers are obtained on a 7-point Likert scale σ1 . and situational explanations identical in both vari-
The biasedness depends on the direction and mag- ants. A score based on the answers selected from
nitude of the deviation. Note that, by definition a 7-point Likert scale σ1 reflects the FAE, which
of the framing effect, a less favorable evaluation is measured as the difference between the types of
is expected to be obtained in the negative framing answers given: when the LLM employs situational
and a more favorable — in the positive one. explanation while being the actor and adopts the
dispositional one in the observer perspective, the
B.24 Anthropomorphism bias is maximized.
Anthropomorphism, or anthropomorphic bias, is
the “tendency to imbue the real or imagined be- B.26 Planning Fallacy
havior of non-human agents with human-like char- Proposed in Kahneman and Tversky (1982), plan-
acteristics” (Epley et al., 2007). Dacey (2017) ar- ning fallacy is defined as the tendency “to under-
gues for treating this effect as a cognitive bias and estimate the completion time, even when one has
analyses several control measures for it. Besides considerable experience of corresponding past fail-
other subjects (Chaminade et al., 2007; Portal et al., ures”. Kahneman and Tversky (1982) introduced
2018), AI has been actively promoting discussions an inside versus outside cognitive model for the
in the studies of anthropomorphism (Proudfoot, planning fallacy, which was extended in Buehler
2011; Salles et al., 2020). et al. (2010). The classical testing procedure com-
We draw the inspiration for the testing from Wad- pares predicted and actual task completion times
dell (2019), which connects the concepts of pref- in various settings (Buehler et al., 1994; Burt and
erence and credibility to anthropomorphism. Our Kemp, 1994).
variation of testing introduces a subjective piece Due to the infeasibility of leveraging the true
of information. In control, it is attributed to a ma- completion times, we test whether the models
chine; in treatment - to a human author. The LLM “maintain their optimism about the current project
is prompted to evaluate the credibility and accuracy in the face of historical evidence to the contrary”
of this information piece. (Buehler et al., 2010). The procedure features the
The anthropomorphism is more prominent when task of allocating time for a project. In the control
the model opts for greater credibility and accuracy version, the LLM is directly asked to estimate the
of the piece when attributed to a human, the an- required percentage of time, while the treatment
swers are obtained on a 7-point Likert scale σ1 . prompt additionally contains the concrete percent-
age of overdue time, i.e., the negative historical ev-
B.25 Fundamental Attribution Error idence for the completion times of similar projects.
Also known as attribution bias, the fundamental Insufficient update in the allocation of time
attribution error (FAE) is first described in Heider across variants suggests the propensity of the model
(1982). It corresponds to the propensity “to un- to maintain the estimates disregarding the negative
derestimate the impact of situational factors and evidence, which indicates the susceptibility to the
to overestimate the role of dispositional factors” planning fallacy. The answers are obtained on an
(Ross, 1977). Experimental practices to measure 11-point percentage scale σ2 .
the bias include the attitude attribution paradigm
(Jones and Harris, 1967) and the silent interview B.27 Hyperbolic Discounting
paradigm (Snyder and Frankel, 1976), among oth- An instantiation of the matching law (Myerson and
ers. Hale, 1984; Herrnstein, 1961), hyperbolic discount-
Our testing follows the methodology in Flick ing “induces dynamically inconsistent preferences,
and Schweitzer (2021), Hooper et al. (2015), and implying a motive for consumers to constrain their
Riggio and Garcia (2009), which elicits the FAE own future choices” (Laibson, 1997). The two com-
from the actor-observer perspective. Both control mon proposed paradigms for eliciting hyperbolic
and treatment feature a description of a controver- discounting involve choosing between predefined
sial action, and between variants, the role of the configurations for the utility function (Ainslie and
LLM varies: it is either the actor or the observer of Haslam, 1992) and directly reconstructing the indi-
the activity. vidual’s utility function (Benzion et al., 1989).
When prompted to select the best reasoning for We approach the testing using the former tech-
the action, the model is provided with dispositional nique (Rubinstein, 2003). In both variants, the
25
LLM is prompted to decide between options of re- groupings (Fu et al., 2012) and is closely connected
ceiving a reward at a corresponding time. Choices to the notion of fairness (Chae et al., 2022).
in the variants are represented in the following dia- We test the bias using a variation of the dicta-
grams, where T ≫ τ > 0, α > 1: tor game (Forsythe et al., 1994; Kahneman et al.,
1986), which is a common approach for testing in-
Control
group bias (Everett et al., 2015; Abbink and Harris,
choice 1 choice 2
2019). In the test, a reward and two subjects are
Reward r Reward α · r introduced. The LLM is prompted to decide which
Time 0 Time τ of the two subjects to assign the reward to. In con-
trol and treatment variants, the first and the second
subjects share a group attribution with the model,
respectively.
Treatment
choice 1 choice 2 In-group bias is present for the LLM’s selections
that coincide with the designated in-group mem-
Reward r Reward α · r bers in both variants. The answers are obtained on
Time T Time T + τ a 7-point Likert scale σ1 .
26
C Selected Cognitive Biases You will be given a decision-making
task with multiple answer options.
Table 5 includes an overview of all cognitive biases
included in our dataset and the five cognitive biases {{test_case}}
we excluded.
Select exactly one option.
D Prompts
Secondly, we provide the LLM’s previous an-
swer together with a list of all the available options
Our framework uses standardized prompts to ob-
(but not the entire template instance) to another
tain answers from the LLMs. For generating test
instance of the same LLM and instruct it to extract
cases, we use the following GEN prompt to sam-
only the selected option:
ple insertions for the template gaps:
You will be given answer options
You will be given a scenario and a from a decision-making task and a
template. written answer.
The template has gaps indicated by
double square brackets containing — OPTIONS —
instructions on how to fill them,
e.g., [[write a sentence]]. {{options}}
— SCENARIO — — ANSWER —
{{scenario}} {{answer}}
— TEMPLATE — — INSTRUCTION —
27
1.0 model and per bias.
Ours
Echterhoff et al. (2024) Figure 11 reports the share of 30,000 test cases
0.8
Proportion of Test Cases
Tjuatja et al. (2024) that resulted in failures during the evaluation pro-
0.6 cedure, per tested model and bias.
Figure 13 contains the low-dimensional visu-
0.4 alization of embeddings of the test cases in our
dataset w.r.t. the corresponding average bias scores
0.2
b across 20 evaluated models.
0.0
200 400 600 800 1000
Number of Tokens
28
Figure 10: Total tokens obtained in decisions, per model (left) and per bias (right). Tokenizer: tiktoken.
Figure 11: Share of decision failures, per model (left), per bias (right).
29
Strange Stories Test (Happé, 1994)
T EMPLATE
Situation:
[[Introduce characters of a naive story]].
[[Introduce the main character’s thought
or action in accordance with the story type]].
[[Write a question in quotation marks the
other character asked to the main character]].
[[Write a reply in quotation marks]].
Prompt:
Is it true what [[the main character replied]]?
Answer options:
Option 1: Yes.
Option 2: No.
Scenario
An everyday common situation featuring
ordinary people under the story type: Joke
Insertions
[[Introduce characters of a naive story]]:
"Tom and Jerry were sitting in a café, enjoying
their morning coffee",
[[Introduce the main character’s thought
or action in accordance with the story type]]:
"Tom was pondering why his coffee always
tasted better when Jerry was around",
[[Write a question in quotation marks the
other character asked to the main character]]:
"Jerry suddenly turned to Tom and asked,
’Do you think it’s because I stir it
with my laughter?’",
[[Write a reply in quotation marks]]:
"Tom chuckled and replied, ’Absolutely,
your laughter adds the perfect sweetness!’",
[[the main character replied]]: "Tom replied".
Table 3: This table shows an example of an adaptation of our framework for creating a different kind of test for
LLM. The test case shows an example of a Strange Stories test, widely used in the theory of mind research (van
Duijn et al., 2023). Gaps sampled from an LLM are highlighted in [[blue]]. The bottom part shows the insertions
generated for the gaps by the test generator.
30
Test Case: Bandwagon Effect
C ONTROL T EMPLATE T REATMENT T EMPLATE
Situation: Situation:
Suppose you are [[a/an]] [[specialist]] at Suppose you are [[a/an]] [[specialist]] at
[[organization]]. [[Describe a possible choice [[organization]]. [[Describe a possible choice
task for the role with details regarding the task for the role with details regarding the
patient]]. You have two equally reasonable patient]]. You have two equally reasonable
opinions on that. opinions on that.
Opinion A: [[Describe an opinion A. Do not Opinion A: [[Describe an opinion B. Do not
include any notion of order between opinions.]]. include any notion of order between opinions.]].
Opinion B: [[Describe an opinion B. Do not Opinion B: [[Describe an opinion A. Do not
include any notion of order between opinions.]]. include any notion of order between opinions.]].
Colleagues [[in your field]] usually express Colleagues [[in your field]] usually express
opinion {{majority opinion}} in similar cases, opinion {{majority opinion}} in similar cases,
and this opinion is getting stronger. and this opinion is getting stronger.
Prompt: Prompt:
Which opinion do you choose? Which opinion do you choose?
Answer options: Answer options:
Option 1: Very likely opinion A Option 1: Very likely opinion A
Option 2: Likely opinion A Option 2: Likely opinion A
... ...
Option 7: Very likely opinion B Option 7: Very likely opinion B
Scenario A cardiologist at a hospital deciding whether a patient should undergo immediate
surgery or continue with medication-based treatment for heart disease.
Insertions [[a/an]]: "a", [[specialist]]: "cardiologist", [[organization]]: "a hospital",
[[Describe a possible choice task for the role with details regarding the patient]]:
"You have a patient with severe heart disease who is currently stable on medication.
However, recent tests indicate a significant blockage that might necessitate surgical
intervention. You must decide whether to proceed with immediate surgery to prevent
potential complications or continue with the current medication regimen.",
[[Describe an opinion A. Do not include any notion of order between opinions.]]:
"Continuing with medication-based treatment is adequate for managing the patient’s
condition, given their current stability", [[Describe an opinion B. Do not include any
notion of order between opinions.]]: "Immediate surgery is necessary to address the
blockage and prevent future cardiac events.", [[in your field]]: "in the medical field,
particularly in the field of cardiology", {{majority opinion}}: "A".
Table 4: This table shows an example of an adaptation of our framework for measuring cognitive biases in different
domains. Test case measures the Bandwagon Effect in LLMs in the medical domain. Gaps are highlighted in
[[blue]] if insertions are sampled from an LLM and in {{red}} if insertions are sampled from a custom values
generator. The bottom part shows the insertions generated for the gaps by the test generator.
31
Rank Cognitive Bias Number of Publications Include/Exclude
Table 5: Overview of cognitive biases considered in this paper. Biases are ranked by the number of publications
mentioning them in a management context. Five biases were excluded because it was either unclear how to test
them in LLMs or they were semantically duplicated with other biases we already included.
32
Release Date Number
Developer Model API Used Version Used of of Reference
Version Used Parameters
gpt-4o
GPT-4o August 6, 2024 200B*
-2024-08-06
OpenAI gpt-4o-mini
OpenAI GPT-4o mini July 18, 2024 10B* –
API -2024-07-18
gpt-3.5-turbo
GPT-3.5 Turbo January 25, 2024 175B*
-0125
meta-llama/
Llama 3.1
Meta-Llama-3.1-405B July 23, 2024 405B
405B
-Instruct
meta-llama/
Llama 3.1
Meta-Llama-3.1-70B July 23, 2024 70B
70B
-Instruct
meta-llama/
Meta Llama 3.1 DeepInfra (Dubey et al., 2024)
Meta-Llama-3.1-8B July 23, 2024 8B
8B
-Instruct
meta-llama/
Llama 3.2
Llama-3.2-3B September 25, 2024 3B
3B
-Instruct
meta-llama/
Llama 3.2
Llama-3.2-1B September 25, 2024 1B
1B
-Instruct
Claude 3 Anthropic claude-3-haiku
Anthropic March 7, 2024 20B* (Anthropic, 2024)
Haiku API -20240307
Gemini 1.5 models/
Google September 24, 2024 200B* (Reid et al., 2024)
Pro gemini-1.5-pro
Gemini 1.5 Generative models/
AI API September 24, 2024 30B*
Google Flash gemini-1.5-flash
Gemma 2 google/
July 27, 2024 27B (Riviere et al.,
27B DeepInfra gemma-2-27b-it
2024)
Gemma 2 google/
July 27, 2024 9B
9B gemma-2-9b-it
Mistral mistral-large
Mistral AI July 24, 2024 123B –
Mistral AI Large -2407
Mistral API mistral-small
September 24, 2024 22B
Small -2409
microsoft/
WizardLM-2
WizardLM-2 April 15, 2024 176B –
8x22B
-8x22B
DeepInfra
microsoft/
WizardLM-2
WizardLM-2 April 15, 2024 7B
Microsoft 7B
-7B
accounts/
Fireworks fireworks/models/
Phi-3.5 September 18, 2024 4.2B (Abdin et al., 2024)
AI API phi-3-vision
-128k-instruct
Qwen/
Alibaba Qwen2.5
DeepInfra Qwen2.5-72B September 18, 2024 72B –
Cloud 72B
-Instruct
accounts/
Fireworks
01.AI Yi-Large yi-01-ai/models/ June 16, 2024 34B (Young et al., 2024)
AI API
yi-large
Table 6: Overview of all evaluated LLMs. Asterisks * denote the rumored number of parameters as the true ones are
not disclosed by the developers.
33
Verifiable
Bias Accuracy
Instruction
Table 7: List of biases with the corresponding verifiable instructions tested using IFE VAL.
34
Figure 12: Visualisation of test embeddings from the dataset using t-SNE. Points are grouped by the test’s bias
type. Each of the 30,000 points is a two-dimensional representation of the average embedding between control and
treatment template instances. Embedding model used: text-embedding-3-large by OpenAI.
35
Figure 13: Visualisation of test embeddings from the dataset using t-SNE. Points are grouped by the average bias
score obtained for the tests across 20 models. Each of the 30,000 points is a two-dimensional representation of the
average embedding between control and treatment template instances. Embedding model used: text-embedding-3-
large by OpenAI.
36