Agarwal Et Al Diagnostic Ai
Agarwal Et Al Diagnostic Ai
While Artificial Intelligence (AI) algorithms have achieved performance levels com-
parable to human experts on various predictive tasks, human experts can still access
valuable contextual information not yet incorporated into AI predictions. Humans as-
sisted by AI predictions could outperform both human-alone or AI-alone. We conduct
an experiment with professional radiologists that varies the availability of AI assistance
and contextual information to study the effectiveness of human-AI collaboration and
to investigate how to optimize it. Our findings reveal that (i) providing AI predictions
does not uniformly increase diagnostic quality, and (ii) providing contextual informa-
tion does increase quality. Radiologists do not fully capitalize on the potential gains
from AI assistance because of large deviations from the benchmark Bayesian model
with correct belief updating. The observed errors in belief updating can be explained
by radiologists’ partially underweighting the AI’s information relative to their own and
not accounting for the correlation between their own information and AI predictions.
In light of these biases, we design a collaborative system between radiologists and AI.
Our results demonstrate that, unless the documented mistakes can be corrected, the
optimal solution involves assigning cases either to humans or to AI, but rarely to a
human assisted by AI.
1
2
“We should stop training radiologists now. Its just completely obvious that within five years,
deep learning is going to do better than radiologists."
– Geoffrey Hinton (in 2016)
1 Introduction
assistance unambiguously improves prediction and decision quality. Unless the costs in terms
of human effort outweigh these benefits, it is optimal to provide AI assistance to humans.
However, substantial literature in economics suggests that humans may err when making
probabilistic judgments by deviating from the benchmark model of Bayesian updating with
correct beliefs (see Benjamin et al., 2019, for a review). The optimal approach for combining
human and AI information in the presence of these mistakes is non-trivial.
The experiment employs professional radiologists whom we recruit through teleradiology
companies to diagnose retrospective patient cases. We experimentally manipulate the in-
formation set radiologists have access to when making decisions in a two-by-two factorial
design. In the minimal information environment, we provide only the chest X-ray image to
which we add either AI predictions or contextual information, or both. The AI information
treatment provides probabilities that a patient case is positive for a potential chest pathology
generated using an algorithm trained on over 250,000 X-rays with corresponding disease la-
bels (Irvin et al., 2019). This algorithm was shown to perform comparably to board-certified
radiologists. The contextual information treatment provides clinical history information that
radiologists typically have available but, for privacy reasons, was not available to train the
AI. This information includes the treating doctors’ indications, the patient’s vitals, and the
patient’s labs.
We first estimate the treatment effects of those informational interventions on radiologists’
prediction accuracy and the probability of making a correct decision. We then test alterna-
tive models of biased belief updating to investigate whether radiologists exhibit systematic
deviations from a Bayesian benchmark when incorporating AI predictions. For example, hu-
mans may suffer from automation bias, a tendency to place more weight on machine-provided
predictions than on one’s own information.3 Additionally, humans may treat AI predictions
as independent of their own information (Enke and Zimmermann, 2019). Finally, we evaluate
various forms of human-AI collaboration – in terms of diagnostic performance and costs of
expert time – that decide, as a function of the AI prediction, to use only the AI input, only
the human input, or the human input with access to AI.
Diagnostic radiology is an ideal laboratory for investigating human-AI interaction for var-
ious reasons. Deep learning has made significant advances in radiology, surpassing human
decision-making in many cases, and such algorithms are already clinically deployed Rajpurkar
and Lungren (2023). It, therefore, serves as a useful leading indicator for other professions
3
This terminology is borrowed from the literature that dates to the proliferation of computerized auto-
mated support systems in aviation, which raised concerns about human complacency or automation bias (see
Alberdi et al., 2009, for an overview).
4
where similar developments will follow. Moreover, radiology is a highly-paid medical spe-
cialty, which means that the potential benefits from productivity enhancements through AI
are large. Radiology is also ideal from a research perspective. Unlike other physicians, radi-
ologists do not have direct contact with patients, allowing us to run a decision experiment
that resembles their normal workflow through a remote interface that we developed. To aid
quantitative analysis, instead of obtaining a free-text report, we collect radiologist’s assessed
probability that a given chest pathology is present and a binary clinical recommendation of
whether to treat or follow up for that pathology. For our experiment, we hired 180 radiolo-
gists through teleradiology companies that serve US hospitals and offer the services of both
US-based and non-US-based radiologists.
Analyzing the quality of our participants’ assessments requires establishing, for each patient
case, a standard or ground truth. An important challenge in our setting is that definitive
diagnostic tests do not exist for most thoracic pathologies and, even when they do exist,
are selectively performed depending on a radiologist’s recommendation. We therefore follow
the machine learning literature (Sheng et al., 2008) and construct a diagnostic standard by
aggregating the assessments of five board-certified radiologists at Mt. Sinai Hospital with at
least ten years of experience and chest radiology as a sub-specialty. We assess the robustness
of all our results by constructing a (leave-one-out) ground truth using the assessments of our
experimental participants and by varying the aggregation method.
There are two important challenges in designing the experiment, which we address using
a combination of experimental designs. First, while teleradiology firms allow us to recruit
radiologists for our experiment we need to compensate them at the market rate, making data
collection expensive. An across-participant design would therefore be cost-prohibitive except
for extremely large effect sizes. In the first design, we, therefore, randomize each participant
into an order in which they are exposed to the four informational environments but do not
encounter cases repeatedly. Thus, this design allows for within-participant comparisons in
addition to an initial pure across-participant comparison. The second challenge is that, to
estimate a model of belief updating, we need to observe radiologists’ assessments of a given
case both with and without AI assistance. Moreover, the assessment without AI assistance is
required to estimate the Bayesian benchmark, which we can compare to the assessment that
incorporates AI assistance. Our second design addresses this issue by asking each participant
to diagnose each patient case once in each of the four information environments in random
order while ensuring at least a two-week wash-out period between two encounters of a given
case. This wash-out period is intended to eliminate the memory of prior information and
5
anchoring effects.4 Most of our treatment effect analysis pools data from the different designs
whereas our model estimates of belief updating only uses data where a radiologist reads the
same case both with and without AI assistance.
We find that AI assistance does not improve humans’ diagnostic quality on average even
though the AI predictions are more accurate than almost two-thirds of the participants in
our experiment. Moreover, the zero average effect cannot be explained by the participants
ignoring these predictions – we observe that radiologists’ reported probabilities move sig-
nificantly towards the AI’s predictions with AI assistance. Instead, the zero effect of AI
assistance is driven by heterogeneous treatment effects – diagnostic quality increases when
the AI is confident (e.g. the predicted probability is close to zero or one) but decreases
when the AI is uncertain. In parallel, AI assistance improves diagnostic quality for patient
cases in which our participants are uncertain, but decreases quality for patient cases in which
our participants are certain. In contrast, providing clinical history does improve diagnostic
quality suggesting that humans have additional valuable information that has not yet been
incorporated into AI predictions.
An upshot of the results is that information available only to radiologists is useful, but
human experts do not correctly combine their information with AI predictions. Specifically,
the result that AI assistance can reduce predictive performance cannot be rationalized if our
participants are Bayesians with correct beliefs because the AI assistance provides weakly
more information to the decision-maker.
Motivated by these results, we analyze two types of deviations from the benchmark model
with correct updating to link errors in probabilistic judgement and optimal deployment of
AI assistance.5 The first type of deviation occurs when agents do not put correct weights on
their own information and AI information. We describe this deviation using the approach
introduced in Grether (1980; 1992) (see Benjamin, 2019, for a review) to define biases in
belief updating. We say that an agent exhibits own-information bias if they over-weights
their baseline information when provided with AI assistance and own-information neglect if
they under-weights it. Analogously, an agent exhibits automation bias if they over-weights
the AI information relative to their own and automation neglect if they under-weights it.
The second type of deviation occurs if agents utilize an incorrect joint distribution of their
4
To ensure that our results do not rely on the wash-out being successful, a third design obtains an
assessment with AI assistance only after assessments without AI assistance have been obtained. However,
this third treatment is subject to order effects. We find no evidence of order effects on diagnostic quality
although there is evidence that familiarity with the interface increases the speed with which participants go
through patient cases.
5
We will remain agnostic about whether the deviations we consider are due to non-Bayesian updating or
can be explained by Bayesian updating with incorrect prior beliefs.
6
Related Literature
A growing body of literature in computer science has explored the predictive performance
of humans versus machine learning algorithms, with radiology often serving as a key area of
application (Rajpurkar et al., 2018, 2017). Additionally, the study of human-AI collaboration
has become an increasingly important facet of medical AI research (Tschandl et al., 2020;
Reverberi et al., 2022). For comprehensive overviews of these areas, see Rajpurkar et al.
(2022); Hosny et al. (2018); Zhou et al. (2021); Lai et al. (2021). Research on the effectiveness
of human-AI collaboration is evolving, with notable studies in radiology including Rajpurkar
et al. (2020); Kim et al. (2020); Park et al. (2019); Seah et al. (2021); Fogliato et al. (2022). An
active literature studies whether AI assistance benefits radiologists, and which radiologists
benefit the most (Rajpurkar et al., 2020; Seah et al., 2021; Ahn et al., 2022; Sim et al.,
2020; Gaube et al., 2023). In contrast to prior studies, we recruit a large group of high-
7
skilled experts from teleradiology companies under contracts that allow us to incentivize
our participants. A key differentiating factor of our research is that, unlike previous studies
which mainly concentrated on performance, our work emphasizes behavioral biases and their
impact on human-AI interaction.
An emerging literature in economics also compares human and AI performance. Within
economics, these studies tend to rely on observational approaches, with examples addressing
issues in medicine (Ribers and Ullrich, 2022; Mullainathan and Obermeyer, 2019) and bail
decisions (Kleinberg et al., 2015; Angelova et al., 2022). However, analyses based on obser-
vational data face critical identification challenges, such as the selective labels problem (see
Kleinberg et al., 2017; Mullainathan and Obermeyer, 2019; Rambachan, 2021)). A limited
set of studies use quasi-experimental approaches (e.g., Stevenson and Doleac, 2019; Angelova
et al., 2022) or randomized controlled trials (e.g., Imai et al. (2020); Bundorf et al. (2020);
Noy and Zhang (2023); Grimon et al. (2022)) to investigate human use of AI tools, typically
focusing on overall performance or variability in participant response. We add to this litera-
ture by developing an experimental approach that manipulates the information environment
that calculates and compares behavior with a Bayesian benchmark to document systematic
biases and demonstrate that these biases lead to a non-trivial delegation problem.6
While several studies in behavioral economics have documented errors in probabilistic judg-
ment and belief formation, they do not focus on the issues surrounding human-AI interaction
(c.f. Tversky and Kahneman, 1974; Benjamin et al., 2019; Enke and Zimmermann, 2019; Con-
lon et al., 2022, for example). Our definitions of own-information and automation bias build
on the framework in Grether (1980). We contribute to this literature in two ways. First,
we develop an approach to estimate the parameters of the model in Grether (1992) in an
environment where the joint distribution of the signals cannot be controlled (or partialled
out) by the researcher.7 This approach is necessary because we cannot modify the signal
within medical images. Second, we link the design of AI information provision to the (bi-
ased) updating rule that humans use. This link shows that the use of AI information by
humans is an important and practical application of the ideas in this literature.
6
Our finding that radiologists exhibit automation neglect is related to those in Dietvorst et al. (2015),
which shows that humans are averse to following algorithmic recommendations as compared to human recom-
mendations. This aversion can be reduced if humans are allowed to modify the algorithm’s recommendation
(Dietvorst et al., 2018).
7
Most applications that we are aware of rely on one of two experimental approaches. In the first approach,
the researcher can partial out either the prior information or the likelihood ratio of the signal provided, for
example in the classic bookbag-and-poker-chip experiments (see Benjamin et al. 2019; Benjamin 2019, for
reviews). In the second approach, the researcher directly provides signals from a known joint distribution
(e.g. Conlon et al., 2022).
8
Our analysis of optimal human-AI collaboration is related to papers that build delegation
algorithms to predict the types of cases for which human performance exceeds machine
performance (e.g. Mozannar and Sontag, 2020; Raghu et al., 2019; Bansal et al., 2021).
Relative to this work, our analysis uses a decision-theoretic model and specific human biases
to trace their consequences for optimal AI deployment.
Finally, our work also adds to the literature on decision-making in the health care context
(e.g. Abaluck et al., 2016; Currie and MacLeod, 2017; Gruber et al., 2021; Chan et al.,
2022; Chandra and Staiger, 2020). This work aims to understand predictions and payoffs
from observational data on medical decisions, objectives that are achievable under less strin-
gent functional form restrictions in our experimental approach. An important distinguishing
feature is that none of these papers consider the effects of AI predictions on medical decision-
making.
Overview
The rest of the paper is organized in the following way. Section 2 introduces our model of a
decision-maker in a diagnostic setting. Section 3 describes the necessary details of the setting
and our experimental design. Section 4 discusses the treatment effects. Section 5 estimates a
descriptive model of deviations from Bayesian updating. Section 6 shows the gains achievable
under the optimal collaboration between radiologists and AI.
2 Conceptual Model
Our study focuses on classification problems and, specifically machine learning algorithms
within the domain of AI tools. These algorithms are designed to predict the appropriate
classification for a given case and may assist a human decision-maker. This decision-maker,
indexed by r, must take a binary action air ∈ {0, 1} on case i based on a prediction of a
binary state ωi ∈ {0, 1}. The realized payoff ur (air , ωi ) from an action depends both on
the state and the action. The expert does not know ωi but observes, depending on the
information environment, a subset of two signals that are potentially informative about the
state. The first signal is generated by a prediction algorithm (AI), with realizations sA A
i ∈ S .
The second signal is directly obtained by the expert, with a realization sE E
ir ∈ S . These
signals are of arbitrary dimension. The joint distributions of the signals conditional on the
state is given by πr (·|ω) ∈ ∆ S A , S E , with prior probabilities over the state π (ω). We do
not place any restrictions on πr (·|ω). The signals need not be independent conditional on
the state of the world, and the signal distribution may depend on the expert, capturing skill
(Chan et al., 2022). We also allow for cases when one of the signals is more informative than
9
We allow the human’s posterior belief given the observed signals to deviate from those implied
n o
by the true probability law πr (·|ω). Specifically, let sir ⊂ sA E
r , sir be the subset of signal
realizations observed by the human r and pr (ω |sir ) ∈ [0, 1] be the human’s posterior belief
that the state is ω ∈ {0, 1} when she observes sir . Suppressing the dependence of signals on
the pair (i, r), the human’s action given the signal s is
( )
pr (ω = 1 |s ) cF P,r
a∗r (s; pr ) = 1 · > crel,r ≡ . (2)
pr (ω = 0 |s ) cF N,r
where decisions are based on the human’s belief pr , but are evaluated according to the true
law πr . Because we allow for pr to differ from πr , the odds ratio pprr (ω=0|s
(ω=1|s )
)
may differ from
∗
the analogous quantity constructed using πr . Thus, the action ar (s; pr ) can deviate from the
optimal action a∗r (s; πr ) given the signal s = sA , sE . Except in knife-edge cases, Vr (s; pr )
is lower than V (s; πr ) whenever a∗ (s; pr ) ̸= a∗ (s; πr ).
A key objective is to analyze deviations in humans’ use of AI signals from the benchmark
model of Bayesian updating with correct beliefs (about the joint distribution of the signals
and the state). Bayes’ rule implies that, given the signals sA E
i , sir , the log-odds of ω = 1 to
ω = 0–the key decision-relevant quantity–is given by
πr ωi = 1 sA E
i , sir π r sA E
i ωi = 1, sir πr ωi = 1 sE
ir
log = log + log . (3)
πr (ωi = 0 |sA E
i , sir ) πr (sA E
i |ωi = 0, sir ) πr (ωi = 0 |sE
ir )
The second term on the right-hand side is the posterior log-odds ratio for the two states
10
ωi = 1 to ωi = 0 given that the human’s signal is sE ir . The first term is positive if, given the
realization sE A
ir , the signal si is more likely if ωi = 1 as compared to ωi = 0. In this case,
the posterior odds shift in favor of the state ωi = 1. Analogously, the odds shift away from
ωi = 1 if, given sE A
ir , the signal si is more likely if the state were ωi = 0.
The task of empirically analyzing deviations in participants’ beliefs from the benchmark in
equation (3) is challenging because the signals differ across cases i and humans may, due to
differences in skill, have heterogeneous signal distributions πr (·) . The ideal dataset would
elicit pr ωi = 1 sAi , sE
ir and π r ωi = 1 s E
ir for the same case and the same human to keep
the signals sA E
i , sir , and human-specific parameters fixed, and use the latter to estimate
πr ω i = 1 s A E
i , sir using equation (3).
3.1.1 Radiology
Radiologists diagnose the presence of a given pathology at the request of a treating physician.
The information available to a radiologist consists of diagnostic images (e.g. chest X-rays),
any relevant medical history (e.g. laboratory results), and clinical indication notes of the
treating physician.8 The treating physician’s notes are of varying detail levels – they may
provide no clinical information or guidance, request the analysis of a specific pathology, or
8
Radiologists are rarely in direct contact with the patient or the treating physician, except via the formal
information exchange outlined here.
11
only list the patient’s primary symptom (see appendix B.2 for examples). Irrespective of
the pathology suspected by the treating physicians, radiologists are expected to report all
pathological findings.
Because image-based classification is a core task performed by radiologists, a high-paying
profession, it is not surprising that AI tools have made significant inroads in the field in the
last decade. Recent advances in deep learning methods for image recognition have yielded
algorithms that can match or surpass the performance of human radiologists (Obermeyer and
Emanuel, 2016; Langlotz, 2019). As of 2020, 55 companies offered a total of 119 algorithmic
products of which 46 have FDA approval (Tadavarthi et al., 2020). Most products related
to clinical decision-making are marketed as support tools as opposed to autonomous tools,
partly due to regulatory and liability issues (Harvey and Gowda, 2020).
3.1.2 CheXpert
We provide AI assistance using the CheXpert model, which is a deep learning prediction
algorithm for chest X-ray interpretation (Irvin et al., 2019). This model is trained on a
dataset of 224,316 chest radiographs of 65,240 patients labeled for the presence of fourteen
common chest radiographic pathologies.9 The algorithm does not use any other patient
information, such as the clinical history or vitals.10 Nonetheless, a prior version of this
algorithm was shown to match or surpass the performance of board-certified radiologists
from Stanford Hospital on five pathologies (Patel et al., 2019). These study results are also
presented to our participants when introducing the AI tool. Section 4 confirms that the
algorithm outperforms a majority of radiologists in our experiment. We relegate additional
details about the algorithm to appendix B.3. The algorithm assistance to our participants
will be in the form of a vector of probabilities for the presence of every pathology.11
9
The term artificial intelligence is typically reserved for a system of different prediction tasks to mimic a
more complex set of behaviors, whereas machine learning is concerned with one specific prediction task. For
a detailed discussion of this distinction see, for instance, (Taddy, 2018).
10
While large datasets of images are increasingly available (e.g. Kramer et al. 2011, Johnson et al. 2016,
Irvin et al. 2019) it is significantly more difficult to construct such datasets for other patient information due
to the compulsory manual review of textual data for HIPPA compliance.
11
Some algorithms attempt to make their predictions explainable to a human by highlighting the parts
of the image that drive a specific prediction. However, prior studies show that providing such localization
in addition to the numeric output does not improve the accuracy of radiologists (Gaube et al., 2022). A
quantitative output allows us to compute a Bayesian benchmark to the radiologist’s prediction, which is
otherwise difficult.
12
Our experiment varies the information available to diagnose patient cases—participants may
or may not receive AI assistance and access to the clinical history. The X-ray is shown under
all information conditions. We expose our participants to all four possible information con-
ditions: X-ray only, henceforth XO; clinical history without AI, henceforth CH ; AI without
clinical history, henceforth AI ; and both clinical history and AI, henceforth AI+CH.
There are two objectives of our experiment. The first is to compute the treatment effects of
AI and CH on diagnostic quality and radiologist time. The second is to analyze systematic
deviations from the Bayesian benchmark.
Both these objectives are complicated by the likely heterogeneity in radiologist skills. For
estimating treatment effects, radiologist heterogeneity implies that a design that randomizes
treatments only across radiologists will require a large participant pool except for extremely
large effect sizes. Our participants are highly paid experts, making this approach expensive.
And, as explained in section 2, across-radiologist variation in information treatments is not
tailored for the second objective. We would ideally know how a given radiologist changes her
assessment for the same case under a different information condition.
Our approach to address these challenges is to use a combination of three different experi-
mental designs, each with certain advantages and disadvantages. Appendix B.1 illustrates
the three design variations.
In the first design, participants are assigned to a random sequence of the four information
treatments. Each information condition is assigned fifteen cases at random without repe-
tition. Participants read all 15 cases in one information environment before moving to the
next one.
This design builds in both across- and within-participant variation in information treatments.
The within-participant variation has greater power because it controls for participant het-
erogeneity at the potential cost of order effects. The concern of order effects is both testable
and mitigated by the randomization of treatment sequence across subjects.
This first design is well-suited to estimate treatment effects of our information environments.
However, as mentioned earlier, it is not ideal for estimating an empirical analog to equation
(5) because no case is encountered twice.
13
Radiologists diagnose each patient case in each of the four information environments in the
second design. For the moment, set aside concerns arising from the feature that the same
radiologist encounters the same case multiple times. This design will allow us to estimate
an empirical analog to equation (5). It also has the added benefit of controlling for both
case-radiologist heterogeneity because, unlike in the previous design, we can conduct within-
case-radiologist comparisons across treatments.
Because radiologists repeatedly encounter cases, we need to address the potential for order
effects due to memory. For example, radiologists might anchor on their previous assessment
using AI predictions or contextual information and might remember this information the
next time the same case is encountered. We, therefore, limit radiologists’ ability to remem-
ber either their diagnosis or previously provided information by using a “washout” interval
between two encounters of the same case.12 Specifically, radiologists complete the experi-
ment in four sessions that are separated by at least two weeks. Each session is similar to the
first design: radiologists diagnose fifteen cases in each of the four information environments
with no case repeated within a session. Across sessions, the information environment under
which a given case is diagnosed is permuted. Thus, by the end of the fourth session, each of
the sixty cases is diagnosed exactly once in each information environment. Our results are
consistent with the washout being effective – radiologists’ predictions do not move towards
the AI prediction if it was provided in a prior session but do if it is provided in the current
session (see figure C.28).
In the third design, we address residual concerns about the order effects of radiologists
diagnosing cases with AI before those without AI–whether due to anchoring, memory, or
experimenter demand–by having participants diagnose fifty cases, first without and then
with AI assistance. Within each block, clinical history is randomly provided in either the
first or second half of images.
This design also allows us to conduct within case-radiologist comparisons. The potential
disadvantage of this design is that we cannot distinguish order effects from the effect of
providing AI. This issue is unavoidable given the guiding principle that participants receive
weakly more information about a case during a repeat encounter. However, we can test for
12
This principle has been used in computer science (Seah et al., 2021; Conant et al., 2019; Pacilè et al.,
2020).
14
and do rule out order effects on accuracy based on the first two designs.
Participants for the first and third designs, which constitute the majority, were recruited
through teleradiology companies. The teleradiology companies allow us to recruit several
experts in a relatively liquid spot market, a practice that is now common for decision exper-
iments with non-expert subjects (Hunt and Scheetz, 2019). Most healthcare providers in the
US rely on these companies’ services, although many large hospitals have on-call radiologists
(Rosenkrantz et al., 2019). We work with teleradiology companies that serve US hospitals
and offer the services of both US-based and non-US-based radiologists. Our contracts with
teleradiology companies specify a piece-rate, and the companies, in turn, compensate the
participants with a piece-rate.13 In addition, we provided monetary incentives for accuracy
to a subset of radiologists, as described in the next section.
The second design required us to work with a partner who could guarantee subjects’ partici-
pation over several months. We collaborated with VinMac healthcare system in Vietnam to
recruit their staff radiologists to ensure continued participation. VinMac is in the process of
developing its own in-house AI capabilities and was willing to assist with our experiment in
exchange for recognition in a publication of the resulting dataset. The VinMac radiologists
did not receive receive any payments to participate in the experiment but we find that their
perfomance is very close to the performance of the tele-radiologists.
In total, 180 radiologists participated in our experiment. Close to 17% of our participants
are US-based, 38% have a degree from a US institution, 80% are affiliated with a large clinic,
and 61% with an academic institution. As demonstrated in appendix C.3, the quality of
the assessments made by the radiologists in our study is comparable to that of the staff
radiologists from Stanford University Hospital, who originally diagnosed the patient case.
3.2.5 Incentives
We cross-randomize incentives for accuracy in the first and third designs but not the second
because of the specific ways in which our partner’s radiologists are employed.14 Payments
were determined following the binarized scoring rule in Hossain and Okui (2013), where truth
13
We worked with three companies with piece-rates ranging from $7.50 to $13.00.
14
There does not appear to be a consensus in the experimental literature on whether incentives are superior
to non-incentivized responses when eliciting beliefs (see Danz, Vesterlund, and Wilson, 2020; Charness,
Gneezy, and Rasocha, 2021). A comparison between the incentivized and non-incentivized participants
allows us to answer this question in the context of this experiment.
15
is determined as described in section 3.3.1 below. This incentive scheme uses a loss function
of the mean squared prediction error, averaging over patient cases and pathologies, and the
respondents earn a fixed bonus of $120 if a random draw is less than the loss function.
This bonus is more than 20% of the base payment to teleradiology firms. We explain to
the participants that expected payments are maximized if they provide their best estimates
using a non-mathematical description of the payment rule. We specify the distribution so
that 30% of pilot participants would earn the bonus, cross-randomized with the other two
treatment arms.
The experiment uses 324 historical patient cases with potential thoracic pathologies from
Stanford University’s healthcare system. For each case, we have access to the chest X-ray
and the clinical history in the form of the primary provider’s written notes, the patient vitals,
and demographics.15 The use of retrospective cases allows us to avoid ethical and other issues
that would arise when experimenting in high-stakes settings.
Our analysis requires constructing ωi for each patient case. There are important challenges
in using an observational dataset of patient health records to construct this field. One ap-
proach would be to use the results from further medical tests. Unfortunately, definitive
gold-standard tests do not exist for most thoracic pathologies.16 Even when follow-up tests
are conducted, they are selected on the likely presence of a pathology, an issue referred to
as the selective labels problem (e.g. Mullainathan and Obermeyer, 2019). Medical outcomes
from patient health records also do not suffice because actions taken by the treating physi-
cian in response to the radiology report contaminate these measures. Recent literature has
suggested instrumental variables approaches for solving this selective labels problem, but this
work targets population quantities and not a “ground truth” on each case (e.g. Chan et al.,
2022; Mullainathan and Obermeyer, 2019).
We construct ωi by aggregating the assessment of a group of expert radiologists, an approach
common in computer science (Sheng et al., 2008; Mccluskey et al., 2021). Specifically, we
15
All cases are first encounters with no prior X-ray as a comparison. We started with 500 cases that fit these
primary criteria. We omitted pediatric cases from this set. Finally, a radiologist reviewed the cases to remove
instances with poor image quality. The clinical history was manually reviewed to remove patient-identifiable
information and cleared for public release.
16
Many pathologies do not have commonly used non-imaging-based diagnostic tools. For instance, the
presence of cardiomegaly – an enlarged heart – can only be determined using imaging tools, thoracic surgery,
or an autopsy.
16
ask five board certified radiologists from Mount Sinai to read each of the 324 cases using
the interface described above with the available X-ray and clinical history. For each case
i and expert radiologist r, we, therefore, obtain πr ωi = 1 sE
i,r for each pathology, which
we aggregate to generate ωi . Specifically, we classify ωi = 1 if r πr ωi = 1 sE
P
i,r /5 > 0.5.
This approach immediately addresses the selective labels problem because the availability of
assessments is not selected on the likelihood of a pathology being present. Results in Wallsten
and Diederich (2001) suggest that, under weak conditions that allow for measurement error in
the reports and correlations across reports, the aggregate opinion of several experts is highly
diagnostic as long as the experts are median unbiased. To assess robustness, we aggregate
πr ωi = 1 sE i,r using both a log-odds average and a leave-one-out mean of the assessments
of all the radiologists in our experiment in the clinical history treatment condition.
We developed the interface to present the patient cases and to collect radiologists’ predictions
and decisions in collaboration with board certified radiologists at Stanford University Hospital
and Mt. Sinai Hospital. In contrast to free-text reports, we designed it to generate structured
and quantitative data on chest X-ray responses that resemble a typical radiological report.
A guiding principle in the design was to mimic clinical practice and to present and obtain
all clinically relevant information. We briefly describe this interface and provide images and
further details in appendix B.
On the landing page of each case, a high-resolution image of a patient’s X-ray is presented to
the radiologist, with the functionality to zoom and adjust brightness and contrast. When the
experiment calls to show the clinical history, the interface presents clinical notes, vitals, and
laboratory results available at the time the X-ray was originally ordered. If the experiment
provides AI assistance, participants are shown AI predictions for fourteen pathologies and a
composite prediction for whether or not there are any relevant findings.
The data entry interface collects a radiologist’s assessments of the probability that various
thoracic pathologies are present for a given case. The probability that a pathology is present
given the available information, i.e. pr (ω = 1 |s), is elicited using a continuous slider. We
visually subdivide possible responses into five intervals with standard language labels used
in written radiological reports to aid the participants.17 Our radiology collaborators grouped
pathologies into eight exclusive parent categories based on their type. Each group has children
17
The specific labels are “Not present”,“Very Likely”, “Unlikely”, “Possible”, “Likely”, and “Highly Likely”.
Several radiological publications have suggested such standardized language for radiological reports. See for
instance Panicek and Hricak (2016).
17
that are more specific, which may be further subdivided in some cases. The groups all
correspond to a standard class of pathologies. For instance, “airspace opacity” is distinct
from a “cardiomediastinal abnormality.” In the main text we focus on the parent categories
with AI predictions and drop further subdivisions from the analysis. We refer to those as
top-level pathologies. Our results are robust to including the lower-level pathologies in the
analysis as we show in appendix C.4. The interface categorizes thoracic pathologies into
groups by type to ease data entry. For example, allowing the user to simultaneously set the
assessed probability of each disease in a specific category to zero.18 In addition, we elicited
an overall bottom-line assessment of whether the radiologists considers the case normal or
not.
We also ask for a binary “treatment/follow-up” recommendation for each pathology that
is not definitively ruled out.19 We will interpret this input as a∗r (s). In a real clinical
setting, a recommendation to follow-up could trigger the treating physician to prescribe
additional medical tests or interventions with potential costs and benefits. Thus, an optimal
recommendation trades off the cost of false positives and false negatives when recommending
an action as in section 2. The probabilistic assessments with the follow-up decision will allow
us to estimate radiologists’ relative cost of false positives and false negatives.
In addition to pr (ω = 1 |s ) and a∗r (s), we record active time, response times, and any click-
stream data that results from the interaction with the interface. The participants are not
explicitly informed about this monitoring, and there are no explicit time limits. Our ex-
periment runs remotely, and participants connect to a server, which hosts the interface and
records responses. The interface has been extensively tested beforehand on different browsers
by conducting pilots with every company we recruited the participants from.20
18
Our radiology collaborators grouped pathologies into eight exclusive parent categories based on their
type. Each group has children that are more specific, which may be further subdivided in some cases. The
groups all correspond to a standard class of pathologies. For instance, an “airspace opacity” is distinct from a
“cardiomediastinal abnormality.” The more specific disease of “bacterial pneumonia” manifests as an airspace
opacity. This structure reduced the burden on our participants, and we piloted the interface with several
radiologists specializing in the interpretation of chest X-rays. Prior clinical research on AI in chest X-Ray
image classification has used similar hierarchies (see Seah et al., 2021, for example).
19
The binary treatment/follow-up decision is only asked for pathologies where a follow-up is clinically
relevant. This includes all pathologies with AI assistance.
20
This interface is browser-based and built using the o-tree framework Chen et al. (2016). Since we
are not directly communicating with our participants we also deploy a device fingerprinting service from
fingerprint.com to ensure that there are no repeat participants.
18
We train the participants using a combination of written instructions and a video. The
materials provide an overview of the experimental tasks, the interface, and information about
the AI assistance tool. The firms and the participants know that the research study involves
retrospective patient cases. To train participants on the AI tool, we provide them with
materials that explain the development of the algorithm, present metrics of its performance
on various diseases, and summarize the algorithm’s performance relative to radiologists based
on prior research. The participants are informed that the algorithm only uses the chest X-
ray to form predictions, and this knowledge is later tested in a comprehension question. In
addition, we show the participants fifty example cases that show the X-ray and clinical history
next to the AI output. After the instructions, they answer eight comprehension questions,
which the participants must answer correctly before proceeding to the experiment. We also
include an endline survey. We do not directly interact with the subjects except to field
questions about the experiment or provide tech support.21 The complete set of instructions
is provided in appendix B.2.
This section’s analysis focuses on measures of performance (deviation from ground truth,
incorrect decision), deviation from AI prediction, and measures of effort. Table 1 summarizes
the data on these measures and sample sizes from our experiment. The main text focuses
on top-level pathologies with AI with robustness to other pathology groups relegated to
appendix C.4.22
Radiologists make the correct follow-up/treatment recommendation on approximately 70%
of case-pathologies on average and spend ~2.8 minutes per case with large variability across
cases. All summary statistics are very similar across the three expertimental designs (tables
C.7 and C.9). For instance, the average deviation from the ground truth for the three designs
ranges from 0.191 to 0.232, and average active time ranges from 2.58 to 3.03 minutes.
21
We are blinded to the participants’ treatment status while the experiment is in progress.
22
These pathology groups and, unless otherwise noted, the subsequent analyses were pre-registered (see
SSR Registration 9620).
19
Note: This table presents summary statistics of the experimental data. Columns (1) and (2) present the mean and standard
deviation for all designs while Columns (3) and (4) present the same for design 1 only. Decision is an indicator for whether
treatment/follow-up is recommended, correct decision is an indicator for whether the decision matches the ground truth, deviation
from ground truth is the absolute difference between the reported probability and the ground truth, deviation from AI is the
absolute difference between the expert’s reported probability and the AI’s reported probability, active time is measured in
minutes.
In a comparison study with three radiologists, Irvin et al. (2019) show that the CheXpert
model yields a better classifier than two out of three radiologists on five pathologies and
all three on a subset of three pathologies. Our results may differ from that because we
use a different pool of radiologists, a different sample of cases, and reads with contextual
information (clinical history) to construct the ground truth. The latter two differences raise
the bar for the AI because they reflect differences in the data-generating process.
To benchmark the quality of AI predictions in our sample and the radiologists in our exper-
iment, we compare our participant pool with the AI input using two performance measures.
The first measure is derived from the receiver operating characteristic (ROC) curve, which
measures the trade-off between the false positive and the true positive rate of a classifier as
the cutoff for classifying a case as positive or negative is varied. It uses only ordinal informa-
tion about the AI. A classifier that is not better than random has an AUROC value of 0.5
whereas a perfectly predictive classifier has a value of one. The second measure is the root
mean squared error (RMSE), that utilizes cardinal information about the AI prediction. A
20
lower RMSE indicates higher performance.23 We pool the data for top-level pathologies with
AI for each radiologist’s reports and for the AI’s prediction.
Figure 1 shows significant heterogeneity in performance across radiologists as well as the scope
for AI assistance to improve radiologist performance. The AUROC shows that the bottom
tail of the distribution of radiologists performs worse than the AI, whereas the upper tail
predicts close to perfectly. This heterogeneity aligns with findings from observational data
based on only treatment decisions (see Chan et al., 2022). The AI is more predictive than
approximately two-thirds of radiologists, whether compared using AUROC or RMSE. Thus,
there is ample room for AI assistance to improve the performance of radiologists. In fact, a
majority of radiologists would do better on average by simply following the AI prediction.
We present pathology-specific comparison results in appendix C.2.
6 6
Density
Density
4 4
2 2
0 0
.2 .3 .4 .5 .6 .6 .7 .8 .9 1
RMSE AUROC
Note: These histograms show distributions of two different accuracy measures of radiologist assessments alongside the AI’s
accuracy. The left graph shows the distribution of the RMSE while the right shows the distribution of the average AUROC.
Both distributions are shrunk to the grand mean using empircal Bayes. These measures are for each radiologist and include
the top-level pathologies. The dotted line is the average measure of the AI algorithm for the corresponding distribution. Only
the assessments where contextual history information is available for the radiologists but not the AI prediction are considered.
Robustness by design and ground truth definition can be found in sections C.4.1 & C.4.2.
We also compare the performances of our participants and the radiologist who originally
diagnosed each patient case in appendix C.3.24 There is no discernible difference between
23
An important distinction between the two measures is that the AUROC is ordinal whereas RMSE is
cardinal.
24
We classified the original free text radiology reports associated with each case as positive, negative, or
uncertain for each pathology using the CheXbert algorithm described in Smit et al. (2020). To facilitate
comparisons, we also discretized the probability assessments elicited during the experiment into these three
categorized. Then, we compared the accuracy of the original reads against the radiologists participating in
the experiment.
21
the two groups, which is consistent with the hypothesis that radiologists participating in
the study were of similar skill and exerted similar effort as the radiologists completing the
original reads.
We now describe the effects of our information treatments estimated using the following
specification:
Yirt = γhi + γCH · dCH (t) + γAI · dAI (t) + γAI×CH · dCH (t) · dAI (t) + εirt ,
where Yrit is an outcome variable of interest for radiologist r diagnosing patient case-pathology
i and treatment t, and γhi are pathology fixed effects since there are multiple pathologies hi
for each case in this pooled analysis. Treatments t vary by whether or not clinical history is
provided dCH (t) ∈ {0, 1} and whether or not AI information is provided dAI (t) ∈ {0, 1}. We
report two-way clustered standard errors at the radiologist and patient-case level. The spec-
ification omits radiologist-specific fixed effects because the treatments are balanced within
radiologists. Cases are also balanced across treatments (see appendix C.1), which suggests
that case randomization was successful. We will also compute conditional treatment effects
given ranges of the AI signal sAi that are grouped based on π ωi = 1| si .
A
Table 2: Average treatment effects
Note: This table summarizes the average treatment effects (ATE) of different information environments on the absolute value of the difference between the radiologist
probability and AI probability (column (1) and (2)); absolute value of the difference between the radiologist probability and the ground truth (columns (3) and (4)); and
radiologists’ effort measured in terms of active time and clicks (columns (5), (6), (7) and (8)). We either pool across all designs (All Designs) or condition on only design 1.
Results on effort measure excludes five patient-cases with unaccounted time measure, and observations from design 3 because of learning effects in this set-up. The results
are for the two top-level pathologies, airspace opacity and cardiomediastinal abnormality. Standard errors are two-way clustered at the radiologist and patient-case level in
parenthesis. Robustness by design can be found in section C.4.1.
22
23
The above analysis focuses on the effects on the marginal distributions of the outcome vari-
ables Yirt for each pathology. Thus, the specification abstracts away from interactions between
pathologies in the effects on information provision, for example due to potential dependence
between the predictions and decisions. In section 5, we will present evidence showing that
the best fitting model has radiologists updating their beliefs as-if they do not account for
dependence between pathologies.
The treatment effect analysis in the main text pools the three experimental designs and
does not condition on the sequence in which subjects encounter information treatments.
Appendix C.4.5 shows that our results are robust to including controls for order effects.
The estimates from all three designs are similar to each other. They are also statistically
indistinguishable from those that use only an across participant comparison from the first
treatment encountered in design 1 (appendix C.4.3).
We begin by testing whether radiologists respond to the information that the AI pro-
vides. The left panel in table 2 shows how the different information environments affect
the disagreement of the radiologists’ report with the AI’s assessment, which is defined
as Yirt = pr (ωi = 1| sirt ) − π ωi = 1| sA
i . The term pr (ωi = 1| sirt ) is elicited whereas
π ωi = 1| sA
i is the AI’s predicted probability that ωi = 1. When t indicates that AI as-
sistance is provided, then sirt = sE A
ir , si , and when t indicates that AI assistance is not
provided, then sirt = sE E
ir . The signal s also depends on whether contextual information is
provided.
The results show that radiologists utilize AI assistance: their predictions are closer to the
AI predictions when provided with assistance. To see this, observe that the control means
for the deviation from the AI are approximately 0.21 for both when we pool designs and for
design 1 only. Treatments where AI is provided reduce this baseline average deviation by
18% (all designs) and 16% (design 1). (p < 0.01).
We next ask whether the information treatments affect radiologists’ diagnostic performance.
As a measure, we consider the absolute deviation of the radiologist’s probability report from
the binary ground truth, Yirt = |pr (ωi = 1| sirt ) − ωi | where lower values implies better per-
formance. Appendix C.4.2 contains results that use a continuous ground truth, which are
24
qualitatively similar. The middle panel of table 2 shows the average treatment effects on
performance.
Our results indicate that while access to contextual information improves performance on
average, AI assistance does not. We find that access to clinical history reduces the deviation
from the ground truth by 4.8% (p < 0.05) of the control mean if we pool designs and by 6.7%
in design 1 (p < 0.05). In contrast, the effects for AI are close to zero and not statistically
significant. In light of the findings in the previous two sections — that the AI is more
accurate than most radiologists and that radiologists move their assessments toward the
AI—it is puzzling that the AI information does not improve accuracy.
This contradiction occurs because the average treatment effects mask significant hetero-
geneity in treatment effects. Our within-participant designs–designs 2 and 3–allow us to
estimate conditional treatment effects given radiologists’ predictions without AI assistance.
Specifically, we partition cases based on the expert’s signal into five equally spaced bins of
πr ωi = 1|sE ir . Figure 2 shows the conditional treatment effects of providing AI assistance
on diagnostic performance. Panel (a) shows the deviation from the ground truth and panel
(b) shows the probability of incorrect decision. We find that providing AI assistance in cases
when the radiologist is uncertain improves performance on both metrics, whereas AI assis-
tance is harmful when the radiologist is close to certain that the pathology is not present
for a given case. Given an average reported probability of 25% or less, the vast majority
of cases fall in the first bin yielding a small average treatment effect that masks important
heterogeneity. The result that AI assistance can decrease performance rejects a model in
which radiologists are Bayesians with correct beliefs.
25
.05 .1
-.1
-.1
-.15 -.15
[0,0.2) [0.2,0.4) [0.4,0.6) [0.6,0.8) [0.8,1] [0,0.2) [0.2,0.4) [0.4,0.6) [0.6,0.8) [0.8,1]
Radiologist probability bins without AI Radiologist probability bins without AI
Note: Panel (a) shows the conditional average treatment effect of providing AI information on the absolute value of difference
between the radiologist probability and the ground truth. Panel (b) shows analogous treatment effects on incorrect diagnosis,
where a correct diagnosis is defined as the treatment recommendation matching the ground truth. Both these treatment effects
are conditional on the ranges of the expert’s prediction constructed using the no AI assistance. Standard errors are two-way
clustered at the radiologist and patient-case level. The error bars depict 95% confidence intervals. Robustness to experimental
design is in appendix C.4.1.
While AI assistance can help uncertain experts, we find that providing uncertain AI predic-
tions reduces performance. As with the analysis of conditional treatment effects given expert
predictions, we estimate conditional treatment effects given AI predictions by partitioning
cases into five bins based on π ωi = 1|sA i . Figure 3 presents the estimates, pooling data
from all three experimental designs. When the AI provides a confident prediction (e.g. either
close to zero or close to one) performance is significantly improved. We see that in the lowest
bins of AI signals, the deviation from the ground truth is reduced. In the second highest
bin we also see a marked, though not statistically signficant, improvement in performance.
However, in the middle range of signals, where the confidence of the AI is low, radiologists’
diagnostic performance and probability of making a correct decision is lower when AI infor-
mation is provided. This result reinforces the conclusion that radiologists err when updating
their beliefs.
26
.1
.05
.05
0 0
-.05
-.05
[0,0.2) [0.2,0.4) [0.4,0.6) [0.6,0.8) [0.8,1] [0,0.2) [0.2,0.4) [0.4,0.6) [0.6,0.8) [0.8,1]
AI prediction bins AI prediction bins
Note: Panel (a) shows the conditional average treatment effect of providing AI information on the absolute value of difference
between the radiologist probability and the ground truth. Panel (b) shows analogous treatment effects on incorrect diagnosis,
where a correct diagnosis is defined as the treatment recommendation matching the ground truth. Both these treatment effects
are conditional on the ranges of AI prediction. Standard errors are two-way clustered at the radiologist and patient-case level.
The error bars depict 95% confidence intervals. Robustness to experimental design is in appendix C.4.1 and C.4.2.
Finally, we turn our attention to the effects of AI assistance on time taken and the number
of unique interaction (clicks) as a proxy for effort. One hypothesis is that AI assistance could
economize on costly human effort without sacrificing overall performance by enabling quicker
assessments. At the polar opposite, it is possible that humans take more time because they
are provided with more information to process. Which of these effects dominate determines
the effect on labor costs when humans use AI assistance, and therefore the optimality of
delegating cases versus a collaborative setup.
Our results indicate that radiologists are slower when provided with AI assistance. The last
panel of table 2 shows the treatment effects on time spent per case and clicks. These outcomes
are measured at the case level. In the X-Ray Only treatment, radiologists spend about 2.6
minutes per case. Both AI and CH increase the time spend per case by a statistically
significant amount of approximately 4%. The interaction effct γAI×CH is not significant
for either of the two outcome variables. These effects suggest that decisions where both
radiologists and the AI are involved come at a non trivial increase in time spent per case.
This result further undercuts the potential benefits in performance from including humans
assisted with AI predictions “in the loop.”
27
4.3 Discussion
We find that AI assistance does not improve the performance of our participants on average,
even though the AI predictions are more accurate than the majority of radiologists, and
that radiologists respond to this assistance. However, the average treatment effects mask
important heterogeneity – AI assistance improves performance for a large set of cases, but
also decreases performance in many instances. These results point to biases in the use of AI
predictions which we will further investigate in section 5.
We also find that humans have access to valuable contextual information, suggesting that
full automation has its drawbacks. But, the biases above – especially in conjunction with
our finding that radiologists take longer when given AI assistance – undercut this potential
information advantage in a setup that involves AI assistance. Thus, the problem of how best
to deploy AI assistance may be non-trivial and the optimal solution may involve selective
automation and/or AI assistance. Section 6 analyzes this problem.
Appendix C.4 shows that the results are qualitatively robust to a variety of alternative
analyses. The results are similar when we split the analysis by experimental design and when
we only focus on the initial across-comparison (using design 1 and 2), although the latter leads
to estimates that are imprecise. Alternative ground-truth measures – including a leave-one-
out ground truth based on the assessments of our experimental participants and a continuous
ground truth, which simply averages the assessments of the ground-truth labelers–also yield
similar conclusions. The qualitative patterns of the treatment effects are unchanged if we
calibrate each radiologists’ assessments to the ground-truth before conducting the analysis.
Finally, incentives for accuracy, which are cross-randomized in designs 1 and 3, do not have
significant effects either.
5 Automation/Own-Information Bias/Neglect
An upshot of the results in section 4 is that our participants deviate from the baseline of a
Bayesian with correct beliefs about the joint distribution of their own information and the
AI signal. In this section, we model and estimate systematic deviations from this benchmark
– which we will refer to as Bayesian25 for short – and determine the implications of these
deviations on utilizing human expertise and AI predictions.
25
The omission of the qualifier “with correct beliefs” slightly abuses terminology because a possible expla-
nation of the deviations we have documented is that our participants are Bayesians but update their beliefs
using an incorrect model for the joint distribution of sA , sE and ω. We will entertain this possibility below.
28
The framework in section 2 shows that a key question is whether the odds-ratios
pr ωi = 1|sA E
i , sir πr ωi = 1|sA E
i , sir
and
pr (ωi = 0|sA E
i , sir ) πr (ωi = 0|sA E
i , sir )
differ from each other. Recall that Bayes’ rule implies that
πr ωi = 1|sA E
i , sir πr s A E
i |ωi = 1, sir πr ωi = 1|sE
ir
ln = ln + ln . (4)
πr (ωi = 0|sA E
i , sir ) πr (sA E
i |ωi = 0, sir ) πr (ωi = 0|sE
ir )
We now consider a set of models of belief updating to describe systematic deviations from
this benchmark. In our model, the human correctly interprets their own signal when AI
assistance is not available but errs when both sA E
i and sir are observed. As we will show below,
AI assistance nonetheless unambigiously improves performance for a subset of parameters
within this family.
The first class of biases that we consider, arises when the two terms on the right-hand side of
equation (4) are incorrectly weighted. Following Grether (1980; 1992), we parametrize this
type of error using the following parsimonious functional form:
pr ωi = 1|sA E
i , sir πr s A E
i |ωi = 1, sir πr ωi = 1|sE
ir
log = br log + dr log , (5)
pr (ωi = 0|sA E
i , sir ) πr (sA E
i |ωi = 0, sir ) πr (ωi = 0|sE
ir )
where br , dr ≥ 0. The Bayesian (with correct beliefs) is a special case with br = dr = 1. While
this linear form is restrictive, it has been useful for documenting several empirical regularities
showing deviations from Bayesian updating like base-rate neglect and under inference (see
Benjamin, 2019, for a review).
We will say that the human exhibits automation bias if br > dr , and automation neglect if
br < dr . As a motivation for this nomenclature, observe that when br > dr , the human
over-weights the AI signal relative to their own. In the specific case when dr = 1, the agent
overshoots when updating the posterior odds relative to a Bayesian. Analogously, if br < dr ,
then the human under-weights the AI signal relative to their own. We say that the human
exhibits own-information neglect if dr < 1 and own-information bias if dr > 1, where the
cutoff value of one is based on the Bayesian benchmark. Own-information biases are similar
to base-rate biases, but apply to beliefs given the expert’s signals instead of unconditional
population rates (see Griffin and Tversky, 1992; Kahneman and Tversky, 1973).
A second class of deviations we consider will allow for models in which decision-makers do
29
where br and dr are allowed to differ from 1 as above. In the case when the signals are jointly
multivariate normal and br = dr = 1, signal dependence neglect yields correlation neglect as
defined in (Enke and Zimmermann, 2019).26 More generally, we will consider models that
vary the conditioning set in the first term on the right-hand side and the dimension of sA i
in the first term on the right-hand side. The specific examples are motivated and discussed
further in section 5.3 below.
We intend the functional form above as descriptions of humans’ updating rules and will
remain agnostic about underlying mechanisms and micro-foundations. In particular, we
remain silent on whether our participants are Bayesians that are utilizing the incorrect joint
distribution of ωi , sA
i , sE
ir when updating their beliefs or if they are non-Bayesians. The
former type of model, known as a quasi-Bayesian model,27 can generate automation bias
or neglect as well as correlation biases.28 An implicit assumption in our model, and likely
other micro-foundations for the functional forms above as well, is that the signal acquired
by the human is invariant to the provision of AI assistance. Whether additional training,
or experience with the AI can correct deviations from the benchmark model is therefore
something that we leave for future work.
Nonetheless, the models above will prove useful for our purposes. From a theoretical per-
spective, they will help outline the types of deviations that potentially decrease decision
quality. From an empirical perspective, they help understand the drivers of the treatment
effects documented earlier and turn out to be a good approximation to the data from the
experiment.
26
If sA E A E
i and sir are unidimensional with si , sir ∼ N (0, Σr ), then the covariance matrix Σr is a sufficient
A E
statistic for the posterior probability that ωi = 1 given the signals if ωi = 1 s i + s ir ≥ εi and εi is
independent of sA i , sE
ir .
27
See Rabin (2013) for a definition and Barberis et al. (1998); Rabin (2002); Rabin and Vayanos (2010) for
examples.
E b
28
To see this, assume that pr sA E A
i |ωi , sir = π si |ωi , sir and pr sE E
ir , ωi = πr si , ωi to generate the
functional form in equation (5) for any br as long as dr = 1. The derivation of equation (6) is similar. In
contrast to automation bias/neglect and correlation biases, own-information
bias/neglect cannot be derived
in a quasi-Bayesian model because we assume that pr ωi |sE E
ir = πr ωi |sir .
30
Figure 4: Comparing decisions with and without AI assistance – Bayesian with correct beliefs
A |ω=1,s ) E
log π(s
π(sA |ω=0,sE )
a∗No AI ̸= a∗Bayesian
a∗Bayesian = 1
a∗No AI = 0
a∗Bayesian = 1
a∗No AI = 1
log(crel )
E
log π(ω=1|s )
π(ω=0|sE )
a∗Bayesian = 0
a∗No AI = 0
a∗Bayesian = 0
No AI
B
ay
a∗No AI = 1
es
ia
n
Note: The figure shows the decision criterion of a Bayesian with and without AI assistance and where their decisions agree and
disagree. Shaded regions show the regions in which AI improves or worsens decision making.
We now show that the types of deviations described above have implications for when AI
assistance unambiguously improves human performance. The results will also illustrate the
utility of the simple functional forms in equations (5) and (6). This subsection drops the i
and r indices for simplicity of notation.
It is useful to start by considering the decisions with and without AI assistance for a Bayesian
decision maker. Figure 4 illustrates the realizations of sA for which the optimal decision with
AI assistance differs from the the decision without AI assistance for a fixed crel . The horizontal
π (ω=1|sE ) π (sA |ω=1,sE )
and vertical axes respectively represent log π(ω=0|sE ) and log π(sA |ω=0,sE ) . As shown by the
π (ω=1|sE )
vertical dashed line, the decision-maker would take the action 1 if and only if log π(ω=0|sE )
exceeds log crel . The solid line represents the analogous boundary for a Bayesian who has
access to AI assistance. Observe that the decisions a Bayesian makes as a function of the
signals sA and sE cannot be improved without additional information. Thus, a Bayesian with
correct beliefs and access to both signals improves upon the no AI action in the vertically
shaded region.
Now consider humans that may deviate from this benchmark model. A human who takes
a given action without AI assistance a∗No AI = a∗ sE ; pr , but a different action with AI
assistance (so that a∗No AI ̸= a∗AI ) makes a worse decision if a∗AI = a∗ sA , sE ; pr disagrees with
31
the Bayesian’s decision a∗Bayesian = a∗ sA , sE ; πr with AI assistance. This follows because,
in the binary action setup, only one of the decisions can agree with the Bayesian decision. In
all other cases, the human’s decision is weakly improved for the signal realization sA , sE .
In other words, a human whose decision changes upon receiving the AI signal sA is better
off with AI assistance only if the change agrees with the Bayesian decision. The human is
unambiguously better off if this property holds for all signals.
Our first result states that a human who deviates from the benchmark model because they
only exhibits automation neglect is unambiguously better off with AI assistance.
log(crel ) log(crel )
Au π(ω=1|sE )
tom π(ω=1|sE ) log π(ω=0|s
ati log π(ω=0|sE ) E)
on
Au
Bia
s:
tom
b>
1, d
ati
=
1
on
AI
Ne
:b
B
gl
<
ay
ect
es
1, d
B
No AI
ay
ia
:
n
es
b<
>
ia
n
1
1, d
No AI
=
1
Note: This figure shows the where the decisions of an expert as a function of the signals disagree with the Bayesian in cases
with and without AI assistance. Panel (a) shows automation bias and neglect (absent own-information bias and neglect). Panel
(b) shows a decision maker who exhibits both automation neglect and own-information bias.
error. Hence, in the presence of own-information bias or neglect, AI assistance can result in
worse decision for some signals.
We next consider a decision-maker with a different type of bias, namely, one in which the
decision-maker exhibits signal dependence neglect. Perhaps not surprisingly, our next result
shows that this type of bias on its own can result in worse decisions with AI assistance:
Proposition 2. Suppose that the human exhibits signal dependence neglect so that the pos-
terior belief is described by equation (6). For any value of b > 0, d > 0, and crel > 0, the
π (sA |ω=1,sE ) π (sA |ω=1)
there exist log-likelihood ratios log π(sA |ω=0,sE ) and log π(sA |ω=0) such that the human attains
lower expected payoff V (s) with AI assistance.
the signal structure. The general signal structure makes it difficult to characterize mistakes
in terms of over or under-updating, unlike in the case of a multivariate normal model. Even
when b = d = 1 so that automation or own-information bias/neglect are not relevant, an
examination of equations (4) and (5) reveals that whether or not a decision-maker exhibits
π (sA |ω=1,sE ) π (sA |ω=1)
under- or over-updating depends on the difference between log π(sA |ω=0,sE ) and log π(sA |ω=0) .
The propositions above have important implications for the design of human-AI collaboration,
which we consider in section 6. Specifically, we study an AI designer who only has access to
the AI signal sA and must decide on one of the three modes of delegation – utilize only the
AI prediction, delegate the case to the human, or provide AI assistance to a human expert.
The results show that other than in the case when automation neglect is the only relevant
bias, the designer must learn the types of biases as well as the distribution of π sA , sE |ω to
determine which delegation modality yields the best decision.
We now turn to an empirical implementation of the model above. The analysis in this section
will be based on designs 2 and 3 because they allow us to observe the same participant make
decisions under all information-conditions on a given case. Start by considering an empirical
analog to equation (5):
pr ωi = 1|sA E
ir , sir π r sA E
ir |ωi = 1, sir πr ωi = 1|sE
ir
log = a + b · log + d · log + εir , (7)
pr (ωi = 0|sA E
ir , sir ) πr (sA E
ir |ωi = 0, sir ) πr (ωi = 0|sE
ir )
where we have omitted heterogeneity across radiologists in br and dr . Appendix C.5.6 dis-
cusses radiologist heterogeneity in our estimates. Two of the terms in this equation are
directly elicited in the experiment: the probability in the second term on the right-hand side,
πr (ωi = 1|sEir ), is equal to the radiologists’ assessment in the treatment arm where the subjects
read cases without AI assistance and the term pr (ω = 1|sA E
ir , sir ) in the dependent variable is
πr (sA |ωi =1,sEir )
the assessment in the treatment arm with AI. The “update term,” given by log π sir
r ( ir |ωi =0,sir )
A E
follows:
πr s A E
ir |ωi = 1, sir πr ωi = 1|sA E
ir , sir πr ωi = 1|sE
ir
log = log − log .
πr (sA E
ir |ωi = 0, sir ) πr (ωi = 0|sA E
ir , sir ) πr (ωi = 0|sE
ir )
If sE
ir can be constructed or controlled for, then we can estimate the first term on the right-
hand side using data on ωi and sA ir via a binary response model. This estimation is possible
because the signal from the AI that the humans receive is isomorphic to the vector of predicted
probabilities for the various diseases observed. The second term in this equation has been
elicited.
This brings us to the second challenge – constructing sE A E
ir when estimating πr ωi = 1|sir , sir
because we do not observe it directly. Let ci be the patient case associated with case-
pathology i and I (ci ) be the set of case-pathologies associated with case ci . If sE
ir is unidimen-
E E E
sional and πr ωi | sir is monotonic in sir , then πr ωi | sir is a valid control variable. How-
ever, we want to allow for the possibility that the radiologist evaluates a case holistically and
uses signals across pathologies. Under the assumption that sA E E
i ⊥ sir ωi , πr ωi′ | si′ r ′
,
i ∈I(c
i )
E
the vector of probabilities for all pathologies reported by r for case i, denoted πr ωi′ | si′ r ′ ,
i ∈I(ci )
is a valid control variable. In this multi-dimensional case, a sufficient condition is that sE ir is
invertible in πr ωi′ | sE′
ir ′
. Our empirical specifications will therefore employ multi-
i ∈I(ci )
variate proxy controls for sE
ir using elicited probability assessments for multiple pathologies.
dence neglect when updating beliefs. As prefaced earlier, although the humans’ and AI’s
signals are not conditionally independent given the ground truth, humans may act as-if they
are. We will therefore estimate and select between models that vary the set of signals condi-
tioned on in the update term. For example, in the case when radiologists behave as if sA i and
E E
sir are independent conditional on ωi , the update term drops the conditioning on sir . We can
vary the pathologies across the set of models considered when constructing I (ci ).29 Including
these models in our selection approach will also tell us whether radiologists’ assessments are
separable across different pathologies.
h i
The correct model of behavior satisfies the conditional moment restriction E εirt | sE A
i,−r , si =
E
0, where si,−r collects the signals of the radiologists other than r in our experiment. For
estimation, we utilize unconditional moment restrictions based on functions of sE i,−r and
π ( i )
ωi =1|sA
sA
i that closely mimic the terms in equation (7). Our instruments include log π ( ωi =0|sA
,
i )
π ( ωi =1|sA ,sE )
and leave-one-out averages of log π ω =0|sAi ,sEir′ for radiologists other than r that use various
( i i ir′ )
E
proxies for sir′ based on different sets of pathologies I (ci ) that are conditioned on.30 Empirical
analogs of the resulting moment conditions are used to estimate the model using GMM.
We will employ the model-selection procedure proposed in Andrews and Lu (2001) to select
between non-nested models. This method utilizes the J-statistic of the GMM objective
function, which is adjusted for the number of moments and parameters that are included
in the model. The selection criterion, MMSC-BIC, penalizes models that reject a greater
number of moment restrictions.
5.4 Results
Table 3 presents estimates from six models, which is a subset of the models we consider.31
The first three models do not utilize clinical history when made available to the participants
to construct the proxy for sEir whereas the last three models do. We consider three models
in each of these two sets by varying I (ci ). The first corresponds to the case when the signal
29
The conditionally independent case corresponds to the extreme case in which I (ci ) = ∅, whereas the
Bayesian model includes all pathologies.
30
Specifically, we construct fourteen instruments. The first is a constant and the second is the average of
π ( ωi =1|sE
ir ′ )
log π ω =0|sE for all r′ ̸= r. The remaining twelve construct the average of log π( ωi =1|sir′ ) ′
π( ωi =0|sir′ ) for all r ̸= r by
( i ir ′
)
varying the conditioning variables sir . The different sets of conditioning variables in sir are presented in the
second panel of appendix table C.21. These sets are used because they are the relevant terms in at least one
of the models that we consider in the testing procedure.
31
See table C.20 in the appendix for the results from all models. Results from all pathologies with AI
are qualitatively similar (see table C.21). These analyses were pre-registered except for the model selection
exercise, which was not included in the pre-analysis plan.
36
sA E
i only includes the focal pathology and is conditionally independent of sir given ωi . The
second allows for dependence between sA E E
i and sir given ωi , but only through π ωi | sir for
the focal pathology and potentially clinical history. The third utilizes the correct update
term constructed using all available AI predictions and the vector of probabilities π ωi | sE
ir
for all pathologies. Setting b = d = 1 and the constant to 0 in this last case corresponds to
Bayesian updating with correct beliefs.
The results from this exercise point to two types of errors in radiologists’ use of AI signals. The
first type of error is that radiologists neglect signal dependence even though AI predictions
and radiologists’ signals are highly correlated after conditioning on the ground truth (see
appendix table C.18). This conclusion follows because we select the model in column 1,
which uses an update term based on equation (6) and therefore does not control for sE .
Another implication of the selected model is that radiologists do not incorporate information
across different pathologies since only the focal pathology is relevant. This result validates
our previous analysis that analyzes each pathology separately.
Second, across different models, including all models that are not selected, we estimate that
radiologists exhibit automation neglect. However, radiologists do not exhibit substantial own-
information bias or neglect since the estimated value of d is close to 1 in all specifications.
Accounting for measurement error in estimating this model is important, as versions that do
not do so estimate d significantly less than 1 (appendix C.5.3) .
Connecting these observations back to our theoretical discussion in section 2, the parame-
ters are such that radiologists will not benefit unambiguously from access to the AI signal,
primarily because of signal dependence neglect.
37
These deviations also explain the heterogeneous conditional average treatment effects docu-
mented in section 4. Figure 6 shows the treatment effect of AI on the deviation from ground
truth of our radiologists along with three different models under which we compute the same
treatment effects: the Bayesian benchmark, the model in equation (5) where radiologists use
the correct updating term, and the selected model from table 3. As expected, the Bayesian
would benefit significantly from gaining access to the AI signal. In fact, as indicated by the
large reductions in the deviation from the ground truth, there is significant potential value
in combining the human and the AI signal. Imposing the model in equation (5) with the
correct updating term – column (6) – reduces these improvements and moves the implied
treatment effects closer to the data. However, we see that throughout the entire signal range
of sA , such a decision-maker would still unambiguously benefit from gaining access to the AI.
Only when we impose the selected update term, under which radiologists do not account for
the dependence of signals, does the model generate the observed negative treatment effect
38
0.3
Human + AI (Data)
Bayes
ATE of AI on deviation from ground truth
0.1
0.0
−0.1
−0.2
−0.3
(0.0, 0.2] (0.2, 0.4] (0.4, 0.6] (0.6, 0.8] (0.8, 1.0]
AI prediction bin
Note: This graph shows the observed conditional treatment effects of providing radiologists access to AI (Human+AI) and
compares those to three different model-implied treatment effects: giving AI access to a Bayesian decision maker, giving AI
access to a decision maker who acts according to the empirical version of equation 5, both under the correct updating term and
when the decision maker treats the AI signal as conditionally independent.
Thus, consistent with the conclusions above, we find evidence for automation neglect, but no
meaningful own-information bias or neglect. These results suggest that the estimated model
represents a good approximation to the experimental data.
Taken together, the results indicate that while there are large potential gains from combining
radiologists’ assessments with AI predictions, biases in radiologists’ use of AI assistance
undercuts these gains. We find that radiologists exhibit both automation-neglect and signal
dependence neglect. These mistakes lead to no better diagnostic performance when provided
with AI assistance. One approach for addressing this issue is to provide radiologists with
better training, an avenue we leave for future research. A second approach is to design better
collaborative systems that are built on predicting the types of cases in which human experts
outperform AI predictions and vice-versa, and the types of cases in which AI-assisted humans
do best, an issue we turn to next.
We now consider the design of collaborative systems between AI and radiologists. The designs
we consider restrict attention to systems that use the AI signal to delegate a case to one of
three modalities – radiologists alone, the AI alone, or the radiologist with access to the AI.
As a warm-up for this exercise, it is useful to examine the predictive performance of the
different modalities, conditional on sA (see figure 7). Recall from the conditional treatment
effect analysis, radiologists’ assessments improve with AI in the lowest and the two highest
bins of AI signals. Figure 7 also shows that in two out of three ranges in which the AI improves
radiologists’ decision-making that its own performance is even better than the performance
of the radiologists with AI.
However, this figure misses the differences in the human time costs across modalities. Our
analysis of the estimated treatment effects shows that radiologists take more time when
provided with AI predictions. Moreover, using AI predictions alone saves on costly human
effort. This points to the conclusion that in most cases where AI improves decision-making,
one is at least as well off by relying exclusively on AI prediction. But using only AI or never
assisting humans with AI predictions may come at the cost of performance in cases when
using the human alone is superior.
Motivated by these observations, we now examine if there is a trade-off between the marginal
costs of human effort and diagnostic performance.
40
0.9
Human Only (Data)
0.8 Human + AI (Data)
AI
0.7 Estimated Model (True Joint Distribution)
Deviation from ground truth
0.5
0.4
0.3
0.2
0.1
0.0
(0.0, 0.2] (0.2, 0.4] (0.4, 0.6] (0.6, 0.8] (0.8, 1.0]
AI prediction bin
Note: This figure shows the performance of the different modalities that we consider for the optimal collaborative system. Cases
are either decided by only the radiologist, only the AI, or the radiologist with access to the AI. The performance measures for
Human Only and Human + AI are constructed from our treatment effect analysis.
41
6.1 Computing the Trade-off Between Decision Loss and Costs of Human Effort
Consider a policy τ (·) that chooses between full automation (AI), radiologist with access to
AI (E + AI), or radiologist without access to AI (E), as a function of the AI signal sA
i . The
optimal policy which minimizes the sum of the expected decision-loss (costs of false positives
and false negatives) and the monetized time cost of using human radiologists solves:
τ ∗ (sA
i ) = arg min −m · Viτ (sAi ) + w · Ciτ (sAi ) . (8)
τ ∈{E,E+AI,AI}
The first term contains the expected decision-loss from a modality given by Viτ (sAi ) = E Virτ (sA ) sA
i ,
i
which is the expected diagnostic quality averaging over radiologists and cases given the modal-
ity, preferences for false positives and false negatives, and the AI signal. Preferences for false
positives and false negatives are allowed to vary by pathology but are the same across modal-
ities. We estimate these preferences using data on the binary treatment recommendations
of the radiologists in our experiment, given their probability assessments. Details of this
preference estimation step are available in appendix C.6. The average cost of false negatives
across pathologies is three times the cost of a false positive. Since we do not know the dollar
cost of a false negative (m), we will present results for varying values of m.
The second term in the objective function contains Ciτ (sAi ) = Cirτ (sA ) sAi , which is the
i
expected radiologist time for a given modality. If the case is fully automated, this time
cost is zero and otherwise those time costs are based on our experimental estimates, which
show that radiologists spend more time on cases when presented with AI predictions. For
the interpretation of the magnitudes, we translate both radiologists’ expected time cost and
the expected decision loss into dollars. For the costs of radiologist time, we set w = $4 per
minute based on a payment of $10 per case and an average time per read of 2.5 minutes.
We solve this problem by training a classification tree that assigns the case to a modality
conditional on the AI signal sA,i . We use 55% of cases as training data for the decision tree,
20% of cases for validation, and 25% in the test set. For this exercise, we focus on two top-
level pathologies, cardiomediastinal abnormality, and airspace opacity. Finally, to quantify
the drawbacks of biased humans, we will contrast the solution from such a system when the
decisions under E + AI are as observed in the data with the system that uses the Bayesian
assessments for E + AI.
An important restriction in this formulation of the delegation problem is that the signal
received by the expert is exogeneous and does not respond to the availability of AI. This
assumption rules out models of endogenous information acquisition, for example, rational
42
inattention as in Sims (2003). A potentially interesting aspect is whether a designer can lever-
age such endogenous responses by designing an information revelation policy that induces
effort. We leave such extensions that leverage insights from information design (Kamenica
and Gentzkow, 2011; Bergemann and Morris, 2019) to future work, but measure the total
amount of time taken by the experts in our experiment with and without AI assistance.
6.2 Results
Figure 8 shows a possibilities frontier for the trade-off between diagnostic quality against
decision time for the two top-level pathologies with AI, calculated by varying m. It also
shows the corner solutions where all cases are assigned to each of the modalities. The results
suggest that there are potential gains from the optimal delegation of cases. An unassisted
radiologist takes 2.7 minutes per case, or about $11, and incurs a relative decision loss of
approximately seven for both of the top level pathologies with AI assistance. By moving
to a solution on the frontier with a human expert – one whose assessments are as in our
experiment – the designer can reduce both decision loss and the time taken per case by
delegating many cases to the AI. The frontier with a Bayesian expert, as expected, is much
more favorable. Moreover, the frontier for both a Human and a Bayesian expert coincide
when the dollar cost of incorrect decisions (m) is small because the vast majority of cases are
assigned to the AI.
AI Only AI Only
(cF N per 100 Cases)
5 5
0 0
0.0 0.5 1.0 1.5 2.0 2.5 3.0 0.0 0.5 1.0 1.5 2.0 2.5 3.0
Mean Time Spent on Case (Minutes) Mean Time Spent on Case (Minutes)
Next, we investigate the proportion of decisions assigned to the three modalities as we vary
43
m (figures 9 and 10). Both for the case where radiologists are Bayesian and the observed
behavior in our experiment, we find that the AI decides almost all cases if the cost of a false
negative – a missed diagnosis – is less than $100 per case. If radiologists acted as Bayesians
with correct beliefs, the share of cases that involve human-AI collaboration rises markedly
above a cost of $100, but even for costs as high as $10,000, this share does not exceed 50% for
airspace opacity and does not exceed 40% for cardiomediastinal abnormality. Moreover, under
Bayesian decision-making, the share of cases where only the radiologist decides is negligible
for both pathologies and the only reason for using an unassisted radiologist is to save on time
costs. When we conduct the same exercise and use the observed behavior of radiologists, we
find that humans are involved in approximately 50% of cases for airspace opacity if the cost
of a false positive is sufficiently large. For cardiomediastinal abnormality, the share of cases
with humans involved only reaches 10% for very large costs of false positives. Moreover, the
majority of cases where a radiologist is involved have the radiologist make decisions without
AI for both pathologies. A more complete assessment of the optimal combination of human
and machine decisions, therefore, confirms the intuition from above that cases are either
decided by the radiologist or the AI but not by both of them together.
0.8 0.8
Share of Cases
Share of Cases
0.6 0.6
0.4 0.4
0.2 AI 0.2 AI
Human Human
Bayesian Human + AI Human + AI
0.0 1 0.0 1
10 102 103 104 10 102 103 104
Social Cost of False Negative (Dollars) Social Cost of False Negative (Dollars)
Note: The graphs show the share of cases decided by each modality (humans, AI, humans+AI) conditional on the cost of a
false negative in dollars, denoted m in the text, for airspace opacity. Panel (a) focuses on a Bayesian decision maker. Panel (b)
focuses on a human decision-maker with decisions and time-taken as in our experiment.
44
1.0 1.0
0.8 0.8
Share of Cases
Share of Cases
0.6 0.6
0.4 0.4
0.2 AI 0.2 AI
Human Human
Bayesian Human + AI Human + AI
0.0 1 0.0 1
10 102 103 104 10 102 103 104
Social Cost of False Negative (Dollars) Social Cost of False Negative (Dollars)
6.3 Caveats
There are several caveats to the analysis here. The first is that any collaborative system may
change radiologists’ expectations about the difficulty of cases and adjust strategically to those
changing expectations. For example, if radiologists are only handed the most difficult cases,
they may exert more effort. We leave such strategic adjustments to a collaborative system
for future work. We also have not considered alternative mechanisms to elicit radiologists’
signals that may be less prone to errors. For instance, one may want to ask the radiologist to
submit their assessment before seeing the AI’s assessment and then combine both assessments
optimally ex-post. Finally, the solutions presented here treat pathologies as separable. This
case is relevant if there is a single pathology of interest.
7 Conclusion
AI is predicted to profoundly reshape the nature of work (see Felten et al., 2023). Humans
are likely to use AI as decision aids for many tasks not only in the long run but also in
the medium run for tasks that will ultimately be fully automated. A central question is
therefore how humans use AI tools and how tasks should be assigned. To understand the
benefits and pitfalls of human-machine collaboration, we conduct an experiment in which
radiologists are provided with AI assistance. Besides serving as an iconic example of a
45
highly-skilled profession that is being transformed by AI, radiology also represents a large
class of professionals whose main job is a high-stakes classification task. Since we can simulate
radiologists’ normal workflow, this is an ideal setting for conducting such an experiment.
While the potential benefits of deploying AI assistance are large in our setting, biases in
humans’ use of AI assistance eliminate these potential gains. Even though the AI tool in our
experiment performs better than two thirds of radiologists, optimally combining their predic-
tions could substantially improve performance. Yet, we find that giving radiologists access
to AI predictions does not, on average, lead to higher performance. This average treatment
effect, however, masks systematic heterogeneity: providing AI does improve radiologists’ pre-
dictions and decisions for cases where the AI is certain, but not when uncertain. This latter
result – that prediction quality can be reduced for some range of AI signals – rejects Bayesian
updating with correct beliefs. We also describe systematic errors in belief updating – namely
radiologists exhibit automation neglect (e.g. radiologists weight the AI prediction relative
to their own) and treat the AI prediction and their own signals as independent even though
they are not. Moreover, radiologists take significantly more time to make a decision when
AI information is provided.
Together, these results have important implications for how to design the collaboration be-
tween humans and machines. Increased time costs and sub-optimal use of AI information
both work against having radiologists make decisions with AI assistance. In fact, an optimal
delegation policy that utilizes heterogeneity in treatment effects given the AI prediction sug-
gests that cases should either be decided by the AI alone or by the radiologist alone. Only a
small share of cases are optimally delegated to radiologists with access to AI. In other words,
we find that radiologists should work next to as opposed to with AI. To the extent that expert
decision makers generally under-respond to information other than their own (Conlon et al.,
2022) and incorporating additional information is cognitively costly, these insights may hold
in other settings where experts’ main job is a classification task.
There are several important considerations that are outside the scope of this work. One
question motivated by the unrealized potential gains of AI assistance we find is whether
the benefits from AI-specific training for radiologists and/or experience with AI are large.
Answering these questions requires different experimental designs or longer-run studies. An-
other open question is whether the heterogeneity in the value of AI assistance is correlated
with a human’s baseline skill or other characteristics, and whether such correlation can be
predicted to target assistance. A consideration in the organization of human-AI collaboration
that we abstract away from is that the form of collaboration may influence experts’ incen-
tives, to which humans may respond strategically. Finally, the use of AI in practice will be
46
References
Jason Abaluck, Leila Agha, Chris Kabrhel, Ali Raja, and Arjun Venkatesh. The determinants
of productivity in medical testing: Intensity and allocation of care. Am. Econ. Rev., 106
(12):3730–3764, December 2016.
Daron Acemoglu and Simon Johnson. Power and progress: Our Thousand-Year struggle over
technology and prosperity. Public Affairs, New York, 2023.
Ajay Agrawal, Joshua Gans, and Avi Goldfarb. Prediction Machines: The Simple Economics
of Artificial Intelligence. Harvard Business Press, April 2018.
Ajay Agrawal, Joshua S Gans, and Avi Goldfarb. Artificial intelligence: The ambiguous labor
market impact of automating prediction. J. Econ. Perspect., 33(2):31–50, May 2019.
Jong Seok Ahn, Shadi Ebrahimian, Shaunagh McDermott, Sanghyup Lee, Laura Naccarato,
John F Di Capua, Markus Y Wu, Eric W Zhang, Victorine Muse, Benjamin Miller, Farid
Sabzalipour, Bernardo C Bizzo, Keith J Dreyer, Parisa Kaviani, Subba R Digumarthy,
and Mannudeep K Kalra. Association of artificial Intelligence–Aided chest radiograph
interpretation with reader performance and efficiency. JAMA Netw Open, 5(8):e2229289–
e2229289, August 2022.
Eugenio Alberdi, Lorenzo Strigini, Andrey A Povyakalo, and Peter Ayton. Why are people’s
decisions sometimes worse with computer support? In Lecture Notes in Computer Sci-
ence, Lecture notes in computer science, pages 18–31. Springer Berlin Heidelberg, Berlin,
Heidelberg, 2009.
Donald W K Andrews and Biao Lu. Consistent model and moment selection procedures for
GMM estimation with application to dynamic panel data models. J. Econom., 101(1):
123–164, March 2001.
Victoria Angelova, Will Dobbie, and Crystal Yang. Algorithmic recommendations and human
discretion, 2022.
Gagan Bansal, Besmira Nushi, Ece Kamar, Eric Horvitz, and Daniel S Weld. Is the most
accurate AI the best teammate? optimizing AI for teamwork. AAAI, 35(13):11405–11414,
May 2021.
Nicholas Barberis, Andrei Shleifer, and Robert Vishny. A model of investor sentiment. J.
financ. econ., 49(3):307–343, September 1998.
Dan Benjamin, Aaron Bodoh-Creed, and Matthew Rabin. Base-Rate neglect: Foundations
and implications, 2019.
Dirk Bergemann and Stephen Morris. Information design: A unified perspective. J. Econ.
Lit., 57(1):44–95, March 2019.
Erik Brynjolfsson and Tom Mitchell. What can machine learning do? workforce implications.
358(6370):1530–, 2017.
Erik Brynjolfsson, Daniel Rock, and Chad Syverson. Artificial intelligence and the modern
productivity paradox: A clash of expectations and statistics. November 2017.
Kate Bundorf, Maria Polyakova, and Ming Tai-Seale. How do humans interact with algo-
rithms? experimental evidence from health insurance. June 2020.
David C Chan, Matthew Gentzkow, and Chuan Yu. Selection with variation in diagnostic
skill: Evidence from radiologists. Q. J. Econ., 137(2):729–783, May 2022.
Gary Charness, Uri Gneezy, and Vlastimil Rasocha. Experimental methods: Eliciting beliefs.
J. Econ. Behav. Organ., 189:234–256, September 2021.
Daniel L Chen, Martin Schonger, and Chris Wickens. oTree—An open-source platform for
laboratory, online, and field experiments. Journal of Behavioral and Experimental Finance,
9:88–97, March 2016.
Emily F Conant, Alicia Y Toledano, Senthil Periaswamy, Sergei V Fotin, Jonathan Go,
Justin E Boatsman, and Jeffrey W Hoffmeister. Improving accuracy and efficiency with
concurrent use of artificial intelligence for digital breast tomosynthesis. Radiol Artif Intell,
1(4):e180096, July 2019.
John J Conlon, Malavika Mani, Gautam Rao, Matthew W Ridley, and Frank Schilbach. Not
learning from others. August 2022.
Janet Currie and W Bentley MacLeod. Diagnosing expertise: Human capital, decision mak-
ing, and performance among physicians. J. Labor Econ., 35(1), 2017.
David Danz, Lise Vesterlund, and Alistair J Wilson. Belief elicitation: Limiting truth telling
with information on incentives. June 2020.
Berkeley J Dietvorst, Joseph P Simmons, and Cade Massey. Algorithm aversion: people
erroneously avoid algorithms after seeing them err. J. Exp. Psychol. Gen., 144(1):114–126,
February 2015.
Berkeley J Dietvorst, Joseph P Simmons, and Cade Massey. Overcoming algorithm aversion:
People will use imperfect algorithms if they can (even slightly) modify them. Manage. Sci.,
64(3):1155–1170, March 2018.
49
Enke and Zimmermann. Correlation neglect in belief formation. Rev. Econ. Stud., 2019.
Ed Felten, Manav Raj, and Robert Seamans. How will language modelers like ChatGPT
affect occupations and industries? March 2023.
Edward W Felten, Manav Raj, and Robert Seamans. The occupational impact of artificial
intelligence: Labor, skills, and polarization. September 2019.
Riccardo Fogliato, Shreya Chappidi, Matthew Lungren, Paul Fisher, Diane Wilson, Michael
Fitzke, Mark Parkinson, Eric Horvitz, Kori Inkpen, and Besmira Nushi. Who goes first?
influences of Human-AI workflow on decision making in clinical imaging. In Proceedings
of the 2022 ACM Conference on Fairness, Accountability, and Transparency, FAccT ’22,
pages 1362–1374. Association for Computing Machinery, June 2022.
Martin Ford. Rise of the Robots: Technology and the Threat of a Jobless Future. Basic Books,
May 2015.
Morgan R Frank, David Autor, James E Bessen, Erik Brynjolfsson, Manuel Cebrian, David J
Deming, Maryann Feldman, Matthew Groh, José Lobo, Esteban Moro, Dashun Wang,
Hyejin Youn, and Iyad Rahwan. Toward understanding the impact of artificial intelligence
on labor. Proc. Natl. Acad. Sci. U. S. A., 116(14):6531–6539, April 2019.
Susanne Gaube, Harini Suresh, Martina Raue, Eva Lermer, Timo Koch, Matthias Hudecek,
Alun D Ackery, Samir C Grover, Joseph F Coughlin, Dieter Frey, Felipe Kitamura, Marzyeh
Ghassemi, and Errol Colak. Who should do as AI say? only non-task expert physicians
benefit from correct explainable AI advice. June 2022.
Susanne Gaube, Harini Suresh, Martina Raue, Eva Lermer, Timo K Koch, Matthias F C
Hudecek, Alun D Ackery, Samir C Grover, Joseph F Coughlin, Dieter Frey, Felipe C
Kitamura, Marzyeh Ghassemi, and Errol Colak. Non-task expert physicians benefit from
correct explainable AI advice when reviewing x-rays. Sci. Rep., 13(1):1383, January 2023.
Avi Goldfarb, Bledi Taska, and Florenta Teodoridis. Could machine learning be a general
purpose technology? a comparison of emerging technologies using data from online job
postings. Res. Policy, 52(1):104653, January 2023.
David M Grether. Testing bayes rule and the representativeness heuristic: Some experimental
evidence. J. Econ. Behav. Organ., 17(1):31–57, January 1992.
Dale Griffin and Amos Tversky. The weighing of evidence and the determinants of confidence.
Cogn. Psychol., 24(3):411–435, July 1992.
Grimon, Marie-Pascale, and Christopher Mills. The impact of algorithmic tools on child
protection: Evidence from a randomized controlled trial. 2022.
50
Jonathan Gruber, Benjamin R Handel, Samuel H Kina, and Jonathan T Kolstad. Managing
intelligence: Skilled experts and decision support in markets for complex products. 2021.
H Benjamin Harvey and Vrushab Gowda. How the FDA regulates AI. Acad. Radiol., 27(1):
58–61, January 2020.
Roxanne Heston and Remco Zwetsloot. Mapping u.s. multinationals’ global AI R&D activity,
2020.
Ahmed Hosny, Chintan Parmar, John Quackenbush, Lawrence H Schwartz, and Hugo J W L
Aerts. Artificial intelligence in radiology. Nat. Rev. Cancer, 18(8):500–510, August 2018.
Tanjim Hossain and Ryo Okui. The binarized scoring rule. Rev. Econ. Stud., 80(3):984–1001,
February 2013.
Nicholas C Hunt and Andrea M Scheetz. Using MTurk to distribute a survey or experiment:
Methodological considerations. Journal of Information Systems, 33(1):43–65, March 2019.
Kosuke Imai, Zhichao Jiang, James Greiner, Ryan Halen, and Sooahn Shin. Experimental
evaluation of Algorithm-Assisted human Decision-Making: Application to pretrial public
safety assessment. December 2020.
Jeremy Irvin, Pranav Rajpurkar, Michael Ko, Yifan Yu, Silviana Ciurea-Ilcus, Chris Chute,
Henrik Marklund, Behzad Haghgoo, Robyn Ball, Katie Shpanskaya, Jayne Seekins,
David A Mong, Safwan S Halabi, Jesse K Sandberg, Ricky Jones, David B Larson, Curtis P
Langlotz, Bhavik N Patel, Matthew P Lungren, and Andrew Y Ng. CheXpert: A large
chest radiograph dataset with uncertainty labels and expert comparison. In Proceedings of
the AAAI Conference on Artificial Intelligence, volume 33, pages 590–597, July 2019.
Alistair E W Johnson, Tom J Pollard, Lu Shen, Li-Wei H Lehman, Mengling Feng, Moham-
mad Ghassemi, Benjamin Moody, Peter Szolovits, Leo Anthony Celi, and Roger G Mark.
MIMIC-III, a freely accessible critical care database. Sci Data, 3:160035, May 2016.
Daniel Kahneman and Amos Tversky. On the psychology of prediction. Psychol. Rev., 80
(4):237–251, July 1973.
Emir Kamenica and Matthew Gentzkow. Bayesian persuasion. Am. Econ. Rev., 101(6):
2590–2615, October 2011.
Hyo Eun Kim, Hak Hee Kim, Boo Kyung Han, Ki Hwan Kim, Kyunghwa Han, Hyeonseob
Nam, Eun Hye Lee, and Eun Kyung Kim. Changes in cancer detection and False-Positive
recall in mammography using artificial intelligence: a retrospective, multireader study. The
Lancet Digital Health, 2(3):e138–e148, March 2020.
Jon Kleinberg, Jens Ludwig, Sendhil Mullainathan, and Ziad Obermeyer. Prediction policy
problems. Am. Econ. Rev., 105(5):491–495, May 2015.
51
Jon Kleinberg, Himabindu Lakkaraju, Jure Leskovec, Jens Ludwig, and Sendhil Mul-
lainathan. Human decisions and machine predictions. Q. J. Econ., 133(1):237–293, August
2017.
Barnett S Kramer, Christine D Berg, Denise R Aberle, and Philip C Prorok. Lung cancer
screening with low-dose helical CT: results from the national lung screening trial (NLST).
J. Med. Screen., 18(3):109–111, 2011.
Vivian Lai, Chacha Chen, Q Vera Liao, Alison Smith-Renner, and Chenhao Tan. Towards a
science of Human-AI decision making: A survey of empirical studies. December 2021.
Xiaoxuan Liu, Livia Faes, Aditya U Kale, Siegfried K Wagner, Dun Jack Fu, Alice Bruynseels,
Thushika Mahendiran, Gabriella Moraes, Mohith Shamdas, Christoph Kern, Joseph R
Ledsam, Martin K Schmid, Konstantinos Balaskas, Eric J Topol, Lucas M Bachmann,
Pearse A Keane, and Alastair K Denniston. A comparison of deep learning performance
against health-care professionals in detecting diseases from medical imaging: a systematic
review and meta-analysis. The Lancet Digital Health, 1(6):e271–e297, October 2019.
Robert Mccluskey, A Enshaei, and B A S Hasan. Finding the Ground-Truth from multiple
labellers: Why parameters of the task matter. ArXiv, 2021.
Hussein Mozannar and David Sontag. Consistent estimators for learning to defer to an expert.
In Proceedings of the 37th International Conference on Machine Learning, volume 119 of
Proceedings of Machine Learning Research, pages 7076–7087. PMLR, 2020.
Justin G Norden and Nirav R Shah. What AI in health care can learn from the long road to
autonomous vehicles. NEJM Catalyst Innovations in Care Delivery, 3(2), 2022.
Shakked Noy and Whitney Zhang. Experimental evidence on the productivity effects of
generative artificial intelligence. March 2023.
Ziad Obermeyer and Ezekiel J Emanuel. Predicting the future — big data, machine learning,
and clinical medicine. N. Engl. J. Med., 375(13):1216–1219, September 2016.
Serena Pacilè, January Lopez, Pauline Chone, Thomas Bertinotti, Jean Marie Grouin, and
Pierre Fillard. Improving breast cancer detection accuracy of mammography with the
concurrent use of an artificial intelligence tool. Radiol Artif Intell, 2(6):e190208, November
2020.
David M Panicek and Hedvig Hricak. How sure are you, doctor? a standardized lexicon
to describe the radiologist’s level of certainty. AJR Am. J. Roentgenol., 207(1):2–3, July
2016.
52
Allison Park, Chris Chute, Pranav Rajpurkar, Joe Lou, Robyn L Ball, Katie Shpanskaya,
Rashad Jabarkheel, Lily H Kim, Emily McKenna, Joe Tseng, Jason Ni, Fidaa Wishah,
Fred Wittber, David S Hong, Thomas J Wilson, Safwan Halabi, Sanjay Basu, Bhavik N
Patel, Matthew P Lungren, Andrew Y Ng, and Kristen W Yeom. Deep Learning-Assisted
diagnosis of cerebral aneurysms using the HeadXNet model. JAMA network open, 2(6):
e195600, June 2019.
Bhavik N Patel, Louis Rosenberg, Gregg Willcox, David Baltaxe, Mimi Lyons, Jeremy Irvin,
Pranav Rajpurkar, Timothy Amrhein, Rajan Gupta, Safwan Halabi, Curtis Langlotz, Ed-
ward Lo, Joseph Mammarappallil, A J Mariano, Geoffrey Riley, Jayne Seekins, Luyao
Shen, Evan Zucker, and Matthew Lungren. Human–Machine partnership with artificial
intelligence for chest radiograph diagnosis. npj Digital Medicine, 2(1):111, December 2019.
M Rabin. Inference by believers in the law of small numbers. Q. J. Econ., 2002.
M Rabin and D Vayanos. The gambler’s and hot-hand fallacies: Theory and applications.
Rev. Econ. Stud., 2010.
Matthew Rabin. Incorporating limited rationality into economics. J. Econ. Lit., 51(2):
528–543, June 2013.
Maithra Raghu, Katy Blumer, Greg Corrado, Jon Kleinberg, Ziad Obermeyer, and Sendhil
Mullainathan. The algorithmic automation problem: Prediction, triage, and human effort.
arXiv, March 2019.
Pranav Rajpurkar and Matthew P Lungren. The current and future state of AI interpretation
of medical images. N. Engl. J. Med., 388(21):1981–1990, May 2023.
Pranav Rajpurkar, Jeremy Irvin, Kaylie Zhu, Brandon Yang, Hershel Mehta, Tony Duan,
Daisy Ding, Aarti Bagul, Curtis Langlotz, Katie Shpanskaya, Matthew P Lungren, and
Andrew Y Ng. CheXNet: Radiologist-Level pneumonia detection on chest X-Rays with
deep learning. (1711.05225), December 2017.
Pranav Rajpurkar, Jeremy Irvin, Robyn L Ball, Kaylie Zhu, Brandon Yang, Hershel Mehta,
Tony Duan, Daisy Ding, Aarti Bagul, Curtis P Langlotz, Bhavik N Patel, Kristen W Yeom,
Katie Shpanskaya, Francis G Blankenberg, Jayne Seekins, Timothy J Amrhein, David A
Mong, Safwan S Halabi, Evan J Zucker, Andrew Y Ng, and Matthew P Lungren. Deep
learning for chest radiograph diagnosis: A retrospective comparison of the CheXNeXt
algorithm to practicing radiologists. PLoS Med., 15(11):e1002686, November 2018.
Pranav Rajpurkar, Chloe O’Connell, Amit Schechter, Nishit Asnani, Jason Li, Amirhossein
Kiani, Robyn L Ball, Marc Mendelson, Gary Maartens, Daniël J van Hoving, Rulan Griesel,
Andrew Y Ng, Tom H Boyles, and Matthew P Lungren. CheXaid: Deep learning assistance
for physician diagnosis of tuberculosis using chest X-Rays in patients with HIV. npj Digital
Medicine, 3:115, December 2020.
Pranav Rajpurkar, Emma Chen, Oishi Banerjee, and Eric J Topol. AI in health and medicine.
Nat. Med., 28(1):31–38, January 2022.
53
Carlo Reverberi, Tommaso Rigon, Aldo Solari, Cesare Hassan, Paolo Cherubini, and Andrea
Cherubini. Experimental evidence of effective human–AI collaboration in medical decision-
making. Sci. Rep., 12(1):1–10, September 2022.
Michael Allan Ribers and Hannes Ullrich. Machine predictions and human decisions with
variation in payoff and skills: the case of antibiotic prescribing, 2022.
Jarrel C Y Seah, Cyril H M Tang, Quinlan D Buchlak, Xavier G Holt, Jeffrey B Ward-
man, Anuar Aimoldin, Nazanin Esmaili, Hassan Ahmad, Hung Pham, John F Lambert,
Ben Hachey, Stephen J F Hogg, Benjamin P Johnston, Christine Bennett, Luke Oakden-
Rayner, Peter Brotchie, and Catherine M Jones. Effect of a comprehensive deep-learning
model on the accuracy of chest x-ray interpretation by radiologists: a retrospective, mul-
tireader multicase study. Lancet Digit Health, 3(8):e496–e506, August 2021.
Victor S Sheng, Foster Provost, and Panagiotis G Ipeirotis. Get another label? improving
data quality and data mining using multiple, noisy labelers. In Proceedings of the 14th ACM
SIGKDD international conference on Knowledge discovery and data mining, KDD ’08,
pages 614–622, New York, NY, USA, August 2008. Association for Computing Machinery.
Yongsik Sim, Myung Jin Chung, Elmar Kotter, Sehyo Yune, Myeongchan Kim, Synho Do,
Kyunghwa Han, Hanmyoung Kim, Seungwook Yang, Dong-Jae Lee, and Byoung Wook
Choi. Deep convolutional neural network–based software improves radiologist detection of
malignant lung nodules on chest radiographs. Radiology, 294(1):199–209, January 2020.
Akshay Smit, Saahil Jain, Pranav Rajpurkar, Anuj Pareek, Andrew Y Ng, and Matthew P
Lungren. CheXbert: Combining automatic labelers and expert annotations for accurate
radiology report labeling using BERT. April 2020.
Megan Stevenson and Jennifer L Doleac. Algorithmic risk assessment in the hands of humans.
December 2019.
Yasasvi Tadavarthi, Brianna Vey, Elizabeth Krupinski, Adam Prater, Judy Gichoya, Nabile
Safdar, and Hari Trivedi. The state of radiology AI: Considerations for purchase decisions
and current market offerings. Radiol Artif Intell, 2(6):e200004, November 2020.
Philipp Tschandl, Christoph Rinner, Zoe Apalla, Giuseppe Argenziano, Noel Codella, Allan
Halpern, Monika Janda, Aimilios Lallas, Caterina Longo, Josep Malvehy, John Paoli,
Susana Puig, Cliff Rosendahl, H Peter Soyer, Iris Zalaudek, and Harald Kittler. Human–
computer collaboration for skin cancer recognition. Nat. Med., 26(8):1229–1234, June
2020.
A Tversky and D Kahneman. Judgment under uncertainty: Heuristics and biases. Science,
185(4157):1124–1131, September 1974.
Thomas S Wallsten and Adele Diederich. Understanding pooled subjective probability esti-
mates. Math. Soc. Sci., 41(1):1–18, January 2001.
Michael Webb. The impact of artificial intelligence on the labor market. 158713(November),
2019.
S Kevin Zhou, Hayit Greenspan, Christos Davatzikos, James S Duncan, Bram Van Ginneken,
Anant Madabhushi, Jerry L Prince, Daniel Rueckert, and Ronald M Summers. A review
of deep learning in medical imaging: Imaging traits, technology trends, case studies with
progress highlights, and future promises. Proc. IEEE, 109(5):820–838, May 2021.
55
Appendix
A Appendix of Proofs
π ( ω=1|sE )
log π( ω=0|sE ) completes this case.
Case d ̸= 1: We analyze this in two subcases.
π ( sA |ω=1,sE )
π ( ω=1|sE )
• (1 − d) log crel > 0: We show that there exist values of log E
π( ω=0|s )
, log π( sA |ω=0,sE )
such that a∗ sA , sE ; p = 0, a∗ sE ; p
= 1, and a∗ sA , sE ; π = 1. Equivalently,
π ( sA |ω=1,sE ) π ( ω=1|sE )
we need to find values such that log crel > b log π( sA |ω=0,sE ) + d log π( ω=0|sE ) , log crel <
π ( ω=1|sE ) π ( sA |ω=1,sE ) π ( ω=1|sE )
log π( ω=0|sE ) and log crel < log π( sA |ω=0,sE ) +log π( ω=0|sE ) if d ̸= 1. Re-write this system
π ( ω=1|sE ) π ( sA |ω=1,sE )
as y = log π( ω=0|sE ) − log crel and x = log π( sA |ω=0,sE ) , we need to find a solution to the
system y > 0, x + y > 0 and bx + dy < (1 − d) log crel . Since (1 − d) log crel > 0, there
exist small enough values of x, y > 0 such that the solution exists.
• (1 −
d) log crel E< 0: An Aargument analogous of case 1 shows that there exist values
π ( ω=1|s ) π ( s |ω=1,sE )
of log π( ω=0|sE ) , log π( sA |ω=0,sE ) such that a∗ sA , sE ; p = 1, a∗ sE ; p = 0, and
a∗ sA , sE ; π = 0.
56
B.1 Design
XO CH AI AI + CH
Track 1
Set 1 Set 2 Set 3 Set 4
CH ... ...
Track 2 ...
Set 1
60 reads per track
15 reads per set
AI
Track 3 ... ... ... Random assignment to
. Set 1 remaining treatments and sets
.
.
Track 24 AI + CH ... ... ...
Set 1
Note: In this design, radiologists are assigned to a randomized sequence of the four information environments., resulting in
twenty-four possible tracks. Under each information environment they read fifteen cases. Radiologists encounter each patient
case at most once. At the beginning of the experiment every radiologist reads eight practice cases. Furthermore, a random half
of the participating radiologists receive incentives for accuracy.
58
Note: In this design, radiologists diagnose sixty patient cases each under the four information environments. Radiologists
read every case under every information environment across four sessions, separated by a washout period. Each case is only
encountered once per session and to ensure that radiologists do not recall their/AI predictions from previous reads of the same
cases, we ensure a minimum two-week washout period between two subsequent sessions. Within every experimental session they
therefore read fifteen under each information environment. The randomization occurs at the track-level where every track has
a different sequence of the information environments. (Example tracks shown here.)
59
Same sets
XO CH AI AI + CH
Set 1 Set 2 Set 1 Set 2
CH XO AI + CH AI
Set 1 Set 2 Set 1 Set 2
Same sets
Note: In this design, radiologists diagnose fifty cases, first without and then with AI assistance. Clinical history is randomly
provided in either the first or second half of images forming the basis of the randomization. The cases diagnosed with and
without clinical history are different.
B.2 Instructions
Below you will find the instructions received by the subjects when receiving the interface-
based treatment. Comments on the instructions are provided in italics and were not seen by
subjects.
Instructions
You are about to participate in a study on medical decision making. You may pause the
study at any time. To resume, revisit the link you were given and your progress will have
been saved.
We will present you with adult patients with potential thoracic pathologies. These patients
will be presented under the following four scenarios:
3. An X-ray is shown along with Artificial Intelligence (AI) support. This AI tool is
described in further detail below.
4. An X-ray is shown along with both additional information on clinical history and the
AI support.
The patients are randomly assigned to each of these scenarios. That is, availability of clinical
history and/or AI support is unrelated to the patient.
60
Clinical History: includes available lab results or indications by the treating physician, if any.
AI support: This tool uses only the X-ray image to predict the probability of each potential
pathology of interest. The tool is based on state-of-the-art machine learning algorithms
developed by a leading team of researchers at Stanford University.
Responses
For each patient and pathology, we will ask for both an assessment and a treatment decision:
1. We will first ask for your assessment of the probability that each condition is present in
a patient. Please consider all pathologies and findings that would be relevant
in a radiology report for the patient. You should express your uncertainty
about the presence of one or many conditions by appropriately choosing the
probability. Note that it is possible that the patient has multiple such conditions or
none of them.
2. If you determine that a pathology may be present, we may ask you to rate the severity
and/or extent of the disease on a scale.
3. Finally, when relevant we will ask whether you would recommend treatment or follow
up according to the clinical standard of care if you determine that the pathology may
be present. The first two responses are diagnostic while the third is a clinical decision.
We are aware that a single physician or radiologist typically does not perform both
tasks. However, for this study, we ask that you respond to the best of your ability in
both of these roles.
Browser Compatibility
This platform supports desktop versions of Chrome, Firefox, and Edge. Important features
on non-supported browsers (including Safari) are missing and we discourage their use for this
experiment. In addition, the platform does not support any mobile devices and the platform
will perform poorly on mobile. If you encounter any issues during the experiment, please
send an email to [email protected] and we will follow-up quickly.
Hierarchy
The interface uses a hierarchy to categorize various thoracic conditions. It will be useful to
familiarize yourself with this hierarchy before you start, but you may also revisit the hierarchy
at any time throughout the experiment by clicking the help tab in the upper right corner.
61
[The probability for the sub-pathologies is required only if the parent pathology prevalence is
greater than 10%]
AI Support Tool
The AI support tool that is provided uses only the X-ray image to predict the probability of
each potential pathology of interest. The tool is based on state-of-the-art machine learning
algorithms developed by a leading team of researchers at Stanford University. The tool is
trained only on X-ray images, meaning it does not incorporate the clinical history of the
patients.
Performance of the AI Support
The AI tool is described in Irvin et al. [2019], which showed the AI tool performed at or near
expert levels across the pathologies studied. Below we plot two measures of performance of
the AI tool. We plot in blue the accuracy of the tool, defined as the share of cases correctly
diagnosed when treating false positives and false negatives equally. In red, we plot the Area
Under the ROC curve (AUC), which is another measure of AI classification performance.
62
The AUC is a number between 0 and 100%, with numbers close to 100% representing better
algorithm performance. The AUC is equal to the probability that a randomly chosen positive
case is ranked higher than a randomly chosen negative case.
Example Images
Below are 50 example images with the associated AI tool predictions. These images are
randomly chosen to allow you to familiarize yourself with the AI support tool and its accuracy
[Here we only provide two out of the fifty images].
63
Demonstration
The brief video below walks you through the interface and a few examples. [At this stage
participants saw an instruction video which can be found here]
Consent
You have been asked to participate in a study conducted by researchers from the Mas-
sachusetts Institute of Technology (M.I.T.) and Harvard University.
The information below provides a summary of the research. Your participation in this re-
search is voluntary and you can withdraw at any time.
1. Study procedure: We will ask you to examine a number of chest x-rays. We will vary
both the amount of information provided about the patient and the availability of an
AI support tool.
2. Potential Risks & Benefits: There are no foreseeable risks associated with this study
and you will receive no direct benefit from participating.
Your participation in this study is completely voluntary and you are free to choose whether
to be in it or not. If you choose to be in this study, you may subsequently withdraw from
it at any time without penalty or consequences of any kind. The investigator may withdraw
you from this research if circumstances arise.
Privacy & Confidentiality
The only people who will know that you are a research subject are members of the research
team which might include outside collaborators not affiliated with MIT. No identifiable infor-
mation about you, or provided by you during the research, will be disclosed to others without
your written permission, except: if necessary to protect your rights or welfare, or if required
by law. In addition, your information may be reviewed by authorized MIT representatives
to ensure compliance with MIT policies and procedures.
When the results of the research are published or discussed in conferences, no information
will be included that would reveal your identity.
Questions
If you have any questions or concerns about the research, please feel free to contact us directly
at [email protected].
Your Rights
You are not waiving any legal claims, rights or remedies because of your participation in this
research study. If you feel you have been treated unfairly, or you have questions regarding
65
your rights as a research subject, you may contact the Chairman of the Committee on the
Use of Humans as Experimental Subjects, M.I.T., Room E25-143B, 77 Massachusetts Ave,
Cambridge, MA 02139, phone 1-617-253 6787.
I understand the procedures described above. By clicking next, I am acknowledging my
questions have been answered to my satisfaction, and I agree to participate in this study.
Interface questions
[Each of these questions has a true of false response which was entered through a radio button.
Participants are not able to start the experiment without answering each question correctly.]
Before beginning the experiment, we would like to confirm a few facts through the following
comprehension questions. Please answer True or False to the following questions.
1) The algorithm’s prediction is based on information from both the X-ray scan as well as
the clinical history.
2) When the algorithm does not show a prediction, it is because the algorithm thinks the
pathology is not present.
3) The follow up decision refers to any treatment or additional diagnostic procedures that
one would conduct based on the findings of the report.
4) Two patients with the same probability score for a condition ought to always receive the
same “follow-up” recommendation.
5) When a condition at a higher level of the hierarchy receives a less than ten percent chance
of being present then all the lower level conditions within this branch automatically receive
a zero probability of being present.
6) If the algorithm says that the probability of a pathology is present with 80% probability,
it means that the AI predicts 80 cases out of 100 have the pathology present.
7) Suppose your assessment is that the patient definitely has either edema or consolidation,
and you believe that edema is twice as likely as consolidation. Then you would assign 66.67%
to edema and 33.33% to consolidation:
8) I should only indicate pathologies and findings that would be relevant in a radiology report
for the patient.
Interface
Figure B.7 is an example of the clinical history indications available to the participating
radiologists under the relevant treatment condition. The thoroughness of the information
66
varies across available information for every patient. Some examples of varying clinical history
information are:
3. 55 years of age, Male, Order History: Relevant PMH gastroparesis. Presents with vom-
iting, retching chest discomfort for a duration of today. Concern for PTX, perforated
viscus, pneumomediastinum.
4. 74 years of age, Female, s/p unwitnessed fall, r/o rib fx, pna or effusion
5. Trauma
6. 56 years of age, Male, S/P ICD/ Pacemaker insertion / Complete X-ray without lifting
arms above shoulders..
67
Note: The clinical history information environment in the experiment had information on patient indications, vitals and abnormal
labs.
68
Note: The participants use the slider to indicate the probability of a pathology being present for a given patient based on the
treatment offered. For prevalence greater than 10% the participants are required to indicate the prevalence of a sub-pathology
(if it exists) and whether a follow-up is recommended.
69
The training data is a set of tuples of images and labels. These training datasets typically
rely on human input to assign the labels, which indicate whether or not a specific pattern
or object is present in the image. Training is conducted through stochastic gradient descent.
These algorithms build on the nested structure of the neural net to compute gradients com-
putationally efficient via chain rule. Each training step is performed on a small batch of
data so that the algorithm does not have to consider the entire dataset for each optimization
step. After each round of optimization on the training set, the model performance is assessed
through predictions on a hold-out validation sample. Most humans are able to recognize cars,
pedestrians, and traffic light, which means that training datasets for common classification
tasks are easy to come by. The same is not true for medical imaging. Classifying disease
based on X-rays, CT scans, and retina scans requires the input of highly trained experts.
Recently, several researchers have released large training datasets of medical images with
disease labels that are extracted from written clinical descriptions (Irvin et al., 2019). The
neural net has a DenseNet121 architecture. A DenseNet is a type of convolutional neural
network that utilizes dense connections between layers through Dense Blocks; in these blocks,
we connect all layers with matching feature-map sizes directly with each other. Images are
supplied in a standardized format of 320 × 320 pixels. For optimization the researchers use
the Adam optimizer with default β-parameters of β1 = 0.9, β2 = 0.999 and learning rate 1
× 10−4. The batch size is fixed at 16 images. The training is performed for 3 epochs. The
full training procedure is described in (Irvin et al., 2019).
C Data Appendix
We verify that the randomization occurred as expected through various balance and ran-
domization tests. Figure C.9 plots the distribution of treatment probabilities by patient-case
in Design 1. We also plot a placebo distribution that samples from the null distribution to
support the claim that the randomization occurred as expected. To test this formally, we
present balance tests for Design 1 and Design 2 in Table C.1 and Table C.2 , respectively.32
For these balance tests, we calculate the average covariates across the four treatment arms
and report p-values from the test of the joint null that the four means are equal. For Design
2, these are done within sessions as patients are balanced by design across all sessions.
32
Design 3 is balanced by design, as each radiologist reads the same cases with and without AI assistance.
70
3 U R S R U W L R Q
7 U H D W P H Q W
3 O D F H E R
, 1 B $ ,
, 1 B 1 $ ,
1 , 1 B $ ,
1 , 1 B 1 $ ,
6 K D U H R I F D V H V Z L W K W U H D W P H Q W
Note: The cumulative distribution functions of patient treatment probabilities by treatment for design 1. The placebo distri-
bution is calculated based on 100,000 draws from the null distribution. For each draw from the null distribution, we sample
the number of reads the case receives from the empirical distribution and then draw the number of treatments from a binomial
distribution with probability 1/4.
Table C.1: Covariate Balance in Design 1
Control CH AI AI x CH p-value
sA 0.308 0.299 0.308 0.307 0.410
Airspace Opacity 0.161 0.142 0.164 0.163 0.222
Cardiomediastinal Abnormality 0.130 0.135 0.132 0.131 0.985
Support Device Hardware 0.173 0.161 0.188 0.195 0.053
Abnormal 0.183 0.179 0.192 0.191 0.744
Weight 185.10 186.67 185.46 184.79 0.656
Temp 99.01 99.04 99.05 99.07 0.057
Pulse 92.11 92.68 92.28 93.02 0.012
Age 56.50 56.15 56.31 56.70 0.882
Number Labs 34.59 34.46 34.50 34.27 0.757
Number Flagged Labs 5.863 5.914 6.060 5.982 0.646
Female 0.424 0.412 0.385 0.392 0.080
Note: Balance tests of patient covariates for patients assigned to the four treatments in Design 1. Missing clinical history variables are mean-imputed. The p-values come
from the joint test the mean covariates are equal across the four treatments.
71
Table C.2: Covariate Balance in Design 2
72
73
C.2
Here, we summarize evidence that the ground truth measure we construct is high quality
and robust to various decisions an analyst could make. Recall that the preferred ground
truth used throughout the paper is defined using the reads of five board-certified radiologists
from Mount Sinai, who each read all 324 patient cases in the study. For each pathology, we
aggregate these reports into the ground truth for a patient case i as
πr (ωi = 1|sE
" 5 #
i,r ) 1
X
ωi = 1 >
r=1 5 2
where we suppress the pathology index for simplicity and r indexes the radiologist. This
method of aggregating reports is robust to certain types of measurement error and dependence
across reports as discussed in Wallsten and Diederich (2001). Table C.3 contains summary
statistics for the ground truth created using the Mount Sinai radiologists and a leave-one-
out internal ground truth calculated using the reads collected during the experiment under
the treatment arm with clinical history but no AI assistance. Table C.4 contains additional
summary statistics for the five Mount Sinai ground truth labelers, including their average
time and number of clicks. We also show the average agreement of the labels with the original
radiologist’s read. Taken together, these analyses demonstrate, for the majority of cases, that
the ground truth labelers agree with the assessment of the radiologist who originally read the
report in a clinical setting and we can reject that the average probability assessment is equal
to 0.5 at the 5% level. Moreover, in Section C.4.2 we show that our results are robust to many
different methods of calculating ground truth, including using the experiment leave-one-out
ground truth and various aggregation methods of the Mount Sinai reports.
Table C.3: Ground Truth Quality
74
75
This section presents distributions for two different accuracy measures for radiologists and
the AI across different pathology groups. These figures allow for a comparison between the
accuracy of the AI relative to the mean radiologist.
3 3 15
2
2 2 10
1
1 1 5
0 0 0 0
.6 .7 .8 .9 1 .5 .6 .7 .8 .9 1 .5 .6 .7 .8 .9 1 .7 .8 .9 1
Abnormal Airspace opacity Atelectasis Bacterial pneumonialobar pneum
4 20
10 10
3 15
2 10
5 5
1 5
0 0 0 0
.5 .6 .7 .8 .9 1 .6 .7 .8 .9 1 .6 .7 .8 .9 1 .7 .8 .9 1
Cardiomediastinal abnormality Cardiomegaly Consolidation Edema
15 15
10 10
5 5
0 0
.7 .8 .9 1 .6 .7 .8 .9 1
Fracture (ribs) Pleural effusion
Note: This figure summarizes the distribution of radiologist AUROCs across different pathologies, as well as the AUROC of
the AI algorithm for the corresponding pathology. Only the cases where contextual history information is available for the
radiologist but not the AI prediction were considered.
76
3 6
4 10
2 4
2 5
1 2
0 0 0 0
.2 .4 .6 .8 0 .1 .2 .3 .4 .5 0 .1 .2 .3 .4 .5 0 .1 .2 .3 .4
Abnormal Airspace opacity Atelectasis Bacterial pneumonialobar pneum
Note: This figure summarizes the distribution of radiologist RMSE across different pathologies, as well as the RMSE of the AI
algorithm for the corresponding pathology. Only the cases where contextual history information is available for the radiologist
but not the AI prediction were considered.
77
The reports from the radiologists who originally read the patient cases included in our sample
were classified as positive/negative/uncertain for each pathology using AI predictions gener-
ated by the CheXbert algorithm described in Smit et al. (2020). We compare the accuracy of
the original reads relative to the ground truth with the radiologists in our sample under the
treatment arm with clinical history and no AI assistance. We do this for each pathology by
converting the probability reports elicited during the experiment to positive/negative assess-
ments, where positive is defined as having a probability greater than 50%. We convert the
CheXbert labels to positive/negative assessments by including the uncertain cases as posi-
tive.33 We then calculate the accuracy of the experiment reads and the CheXbert labels for
groups of pathologies focused on in this study and test the null hypothesis that the accuracy
of the radiologists is the same. The results of this analysis are in Table C.5, and those for
when treating uncertain cases negative are in Table C.6.
Top-Level Pooled
Abnormal
with AI with AI
(1) (2) (3)
Experiment -0.004 -0.005 -0.086
(0.016) (0.006) (0.025)
Constant 0.194 0.090 0.466
(0.016) (0.006) (0.028)
Observations 9718 53449 4859
R-Squared 0.000 0.000 0.002
Note: Regression of indicator equal to one if binarized assessment is equal to ground truth from both the original reads and
experiment reads onto a constant and an indicator equal to one if the radiologist was in the experiment. Standard errors are
clustered at the patient-case level.
33
For all pathologies but bacterial pneumonia and atelectasis, fewer than 5% of patients have uncertain
cases. For abnormal and all of the top-level pathologies with AI, there are no cases with uncertain labels.
78
Top-Level Pooled
Abnormal
with AI with AI
(1) (2) (3)
Experiment -0.004 0.020 -0.086
(0.016) (0.005) (0.025)
Constant 0.194 0.065 0.466
(0.016) (0.005) (0.028)
Observations 9718 53449 4859
R-Squared 0.000 0.000 0.002
Note: Regression of indicator equal to one if binarized assessment is equal to ground truth from both the original reads and
experiment reads onto a constant and an indicator equal to one if the radiologist was in the experiment. Standard errors are
clustered at the patient-case level.
79
C.4 Robustness
In this section, we show the robustness of the results from Section 4.2.
C.4.1 By Design
Design 1
This section presents summaries of radiologist accuracy and treatment effect estimates using
data from design 1. We do not present treatment effects conditional on radiologist prediction
as same cases are not read in the design.
Density
4
5
0 0
.2 .3 .4 .5 .5 .6 .7 .8 .9 1
RMSE AUROC
Note: Main specifications similar to figure 1 with the exception that observations are from design 1 only.
80
.1
0
0
-.05 -.1
-.1 -.2
[0,0.2) [0.2,0.4) [0.4,0.6) [0.6,0.8) [0.8,1] [0,0.2) [0.2,0.4) [0.4,0.6) [0.6,0.8) [0.8,1]
AI prediction bins AI prediction bins
Note: Main specifications similar to figure 3 with the exception that observations are from design 1 only.
81
Design 2
This section presents summaries of key variables, radiologist accuracy and treatment effect
estimates using data from design 2.
Mean SD
(1) (2)
15
15
Density
Density
10 10
5 5
0 0
.25 .3 .35 .4 .45 .8 .85 .9
RMSE AUROC
Note: Main specifications similar to figure 1 with the exception that observations are from design 2 only.
82
.2
.05
-.05 0
-.1 -.1
-.15 -.2
[0,0.2) [0.2,0.4) [0.4,0.6) [0.6,0.8) [0.8,1] [0,0.2) [0.2,0.4) [0.4,0.6) [0.6,0.8) [0.8,1]
Radiologist probability bins without AI Radiologist probability bins without AI
Note: Main specifications similar to figure 2 with the exception that observations are from design 2 only.
.15 .1
ATE on deviation from ground truth
.1
.05
.05
0 0
-.05
-.05
-.1
[0,0.2) [0.2,0.4) [0.4,0.6) [0.6,0.8) [0.8,1] [0,0.2) [0.2,0.4) [0.4,0.6) [0.6,0.8) [0.8,1]
AI prediction bins AI prediction bins
Note: Main specifications similar to figure 3 with the exception that observations are from design 2 only.
83
Note: Main specifications similar to table 2 with the exception that observations are from design 2 only.
84
Design 3
This section presents summaries of key variables, radiologist accuracy and treatment effect
estimates using data from design 3.
Mean SD
(1) (2)
10 10
Density
Density
5 5
0 0
.25 .3 .35 .4 .45 .6 .7 .8 .9 1
RMSE AUROC
Note: Main specifications similar to figure 1 with the exception that observations are from design 3 only.
85
.1 .2
-.1
-.1
-.2 -.2
[0,0.2) [0.2,0.4) [0.4,0.6) [0.6,0.8) [0.8,1] [0,0.2) [0.2,0.4) [0.4,0.6) [0.6,0.8) [0.8,1]
Radiologist probability bins without AI Radiologist probability bins without AI
Note: Main specifications similar to figure 2 with the exception that observations are from design 3 only.
.1
.1
.05
0
0
-.1
-.05
-.1 -.2
[0,0.2) [0.2,0.4) [0.4,0.6) [0.6,0.8) [0.8,1] [0,0.2) [0.2,0.4) [0.4,0.6) [0.6,0.8) [0.8,1]
AI prediction bins AI prediction bins
Note: Main specifications similar to figure 3 with the exception that observations are from design 3 only.
86
Note: Main specifications similar to table 2 with the exception that observations are from design 3 only.
87
This section computes the main results using a ground truth constructed using a leave-one-
out average of assessments by radiologists participating in the experiment in the treatment
arm with clinical history but no AI assistance. Specifically,
for each radiologist r and patient
P π(ωi =1|sE
ir )
case i we construct ωir = 1 r′ ̸=r Ni −1 > 0.5 .
6
4
Density
4 Density
2
2
0 0
.2 .3 .4 .5 .6 .5 .6 .7 .8 .9 1
RMSE AUROC
Note: Main specifications similar to figure 1 with the exception that ground truth is constructed using experiment leave-one-out
average.
.05 .1
ATE of AI on incorrect decision
0
0
-.05
-.1
-.1
-.15 -.2
[0,0.2) [0.2,0.4) [0.4,0.6) [0.6,0.8) [0.8,1] [0,0.2) [0.2,0.4) [0.4,0.6) [0.6,0.8) [0.8,1]
Radiologist probability bins without AI Radiologist probability bins without AI
Note: Main specifications similar to figure 2 with the exception that ground truth is constructed using experiment leave-one-out
average.
88
0 .05
-.05 0
-.1 -.05
[0,0.2) [0.2,0.4) [0.4,0.6) [0.6,0.8) [0.8,1] [0,0.2) [0.2,0.4) [0.4,0.6) [0.6,0.8) [0.8,1]
AI prediction bins AI prediction bins
Note: Main specifications similar to figure 3 with the exception that ground truth is constructed using experiment leave-one-out
average.
Note: Main specifications similar to table 2 with the exception that ground truth is constructed using experiment leave-one-out
average.
89
This section computes the main results using a continuous ground truth constructed using a
simple average of the ground truth labelers’ probability assesment.
AI (Pct. 2)
8
6
Density
0
.2 .3 .4 .5 .6
RMSE
Note: Main specifications similar to figure 1 with the exception that ground truth is constructed using continuous values.
.05
-.05
-.1
-.15
[0,0.2) [0.2,0.4) [0.4,0.6) [0.6,0.8) [0.8,1]
Radiologist probability bins without AI
Note: Main specifications similar to figure 2 with the exception that ground truth is constructed using continuous values.
90
.1
-.05
[0,0.2) [0.2,0.4) [0.4,0.6) [0.6,0.8) [0.8,1]
AI prediction bins
Note: Main specifications similar to figure 3 with the exception that ground truth is constructed using continuous values.
Note: Main specifications similar to table 2 with the exception that ground truth is constructed using continuous values.
91
The following graphs contain only those cases from the treatment group that the subjects
encountered first. This includes the first 15 reads from design 1 and the first 5 reads from
design 2. This exercise is to check if the treatment effects for all the reads is different than
for the first reads.
30 15
Density
Density
20 10
10 5
0 0
0 .1 .2 .3 .4 .75 .8 .85 .9 .95 1
RMSE AUROC
Note: Main specifications similar to figure 1 with the exception that observations are from the first treatment received in designs
1 and 2 only.
0
0
-.2
-.2
-.4
-.4 -.6
[0,0.2) [0.2,0.4) [0.4,0.6) [0.6,0.8) [0.8,1] [0,0.2) [0.2,0.4) [0.4,0.6) [0.6,0.8) [0.8,1]
AI prediction bins AI prediction bins
Note: Main specifications similar to figure 3 with the exception that observations are from the first treatment received in designs
1 and 2 only.
92
Note: Main specifications similar to table 2 with the exception that observations are from the first treatment received in designs
1 and 2 only.
.02
from AI in Session 2
0
ATE on Deviation
-.02
Note: This graph shows the treatment effects of the deviation from AI in the second session because of receiving AI signal in the
first session conditional on receiving AI signal in the second session (Lagged Effect). On the other hand, the contemporaneous
effects show the deviation from AI in the second session given the participants receive AI in the second session, conditional on
receiving AI signal in the second session. These graphs are valid only for design 2 as participants see the same image but in a
different information environment.
93
C.4.4 Incentives
This section tests if incentives for assessment accuracy promote radiologists to make more
accurate assessments. We find that the incentives do not play a significant role in getting
a correct response. The effect of incentives are estimated using the following regression
specification and the results are shown in Table C.14.
where Yirt is an outcome variable of interest for radiologist r diagnosing patient case-pathology
i and treatment t, and γhi are pathology fixed effects. Here CH refers to cases with access to
clinical history information, AI to cases with AI predictions and IN C refers to incentivized
cases.
Table C.14: Effect of Incentives
Note: This table summarizes the average treatment effects (ATE) of different information environments on the (1) absolute value of the difference between the radiologist
probability and AI algorithm probability (Columns (1) and (2)), absolute value of the difference between the radiologist probability and the ground truth (Columns (3)
and (4)) and radiologists’ effort measured in terms of active time and clicks for all and the second-half images (Columns (5) and (6)). The F-statistic tests for the joint
significance of the four incentivized groups. Top-level specification includes two pathologies: airspace opacity and cardiomediastinal abnormality while Pooled AI includes
94
all the pathologies with AI predictions excluding abnormality and support device hardware. Only cases in design 1 and design 3 are considered. Two-way clustered standard
errors at the radiologist and patient-case level are in parenthesis.
95
Figure C.29 uses the following specification that controls for the sequence number in which the
participants saw a particular case within one experiment session and the session dummies
for the different designs and experiment sessions to estimate the heterogeneous treatment
effects. There are four sessions in Design 2, whereas Design 1 and 3 have only one session.
Xh i
Yirt = γhi + γAI · dAI (t) + γg · dg (sA A
i ) + γAI×g · dAI (t).dg (si ) + γwirt + γmirt + εirt
g
where Yirt is an outcome variable of interest for radiologist r diagnosing patient case-pathology
i and treatment t, γhi are pathology fixed effects, γwirt are sequence number dummies and
γmirt are session dummies. Here, g is defined as an index for an interval of the AI signal
range where 1 ≤ g ≤ 5. A case is said to be in an interval g conditional on the signal value
for the given patient-case.
.1 .1
.05 .05
0
0
-.05
-.05
[0,0.2) [0.2,0.4) [0.4,0.6) [0.6,0.8) [0.8,1] [0,0.2) [0.2,0.4) [0.4,0.6) [0.6,0.8) [0.8,1]
AI prediction bins AI prediction bins
Note: Main specifications similar to figure 3 with additional controls for rounds and session.
96
10 10
Density
Density
5 5
0 0
.15 .2 .25 .3 .35 .4 .6 .7 .8 .9 1
RMSE AUROC
Note: Main specifications similar to figure 1 with the exception that reported probability of the radiologists is calibrated to the
ground truth.
.02
ATE on deviation from ground truth
.1
0
.05
-.02
0
-.04
-.06 -.05
-.08
-.1
[0,0.2) [0.2,0.4) [0.4,0.6) [0.6,0.8) [0.8,1] [0,0.2) [0.2,0.4) [0.4,0.6) [0.6,0.8) [0.8,1]
Radiologist probability bins without AI AI prediction bins
Note: Main specifications similar to figure 2 and figure 3 with the exception that reported probability of the radiologists is
calibrated to the ground truth and there are no separate results for ATE on incorrect decision.
98
Deviation from GT
Treatment (1)
AI × CH 0.000
(0.004)
AI −0.001
(0.003)
CH −0.002
(0.003)
Control Mean 0.184
(0.010)
Pathology FE Yes
Observations 36279
Note: Main specifications similar to table 2 with the exception that reported probability of the radiologists is calibrated to the
ground truth and hence only the ATE on deviation from ground truth is reported.
8
4
Density
Density
4
2
2
0 0
.1 .2 .3 .4 .6 .7 .8 .9 1
RMSE AUROC
Note: Main specifications similar to figure 1 with the exception that all patholgies with AI are considered.
99
.05 .1
-.05
-.1
-.1
-.2
-.15
-.2 -.3
[0,0.2) [0.2,0.4) [0.4,0.6) [0.6,0.8) [0.8,1] [0,0.2) [0.2,0.4) [0.4,0.6) [0.6,0.8) [0.8,1]
Radiologist probability bins without AI Radiologist probability bins without AI
Note: Main specifications similar to figure 3 with the exception that all patholgies with AI are considered.
.1 .1
.05
.05
0
0
-.05
-.05
[0,0.2) [0.2,0.4) [0.4,0.6) [0.6,0.8) [0.8,1] [0,0.2) [0.2,0.4) [0.4,0.6) [0.6,0.8) [0.8,1]
AI prediction bins AI prediction bins
Note: Main specifications similar to figure 3 with the exception that all patholgies with AI are considered.
100
Abnormal
4 8
3 6
Density
Density
2 4
1 2
0 0
.2 .4 .6 .8 .6 .7 .8 .9 1
RMSE AUROC
Note: Main specifications similar to figure 1 with the exception that only the abnormal pathology is considered.
.2
.1
ATE of AI on incorrect decision
.1 .05
0
0
-.05
-.1
-.1
[0,0.2) [0.2,0.4) [0.4,0.6) [0.6,0.8) [0.8,1] [0,0.2) [0.2,0.4) [0.4,0.6) [0.6,0.8) [0.8,1]
Radiologist probability bins without AI Radiologist probability bins without AI
Note: Main specifications similar to figure 3 with the exception that only the abnormal pathology is considered.
101
-.05 -.05
-.1 -.1
[0.2,0.4) [0.4,0.6) [0.6,0.8) [0.8,1] [0.2,0.4) [0.4,0.6) [0.6,0.8) [0.8,1]
AI prediction bins AI prediction bins
Note: Main specifications similar to figure 3 with the exception that only the abnormal pathology is considered.
Table C.17: Average Treatment Effects
Note: Main specifications similar to table 2 with the exception that only the abnormal pathology is considered.
102
103
Table C.18 presents evidence that human and AI signals are not conditionally independent.
To test the hypothesis of conditional independence, we regress the human report in the
treatment arm without AI assistance on the ground truth, the AI score, and the interaction
between ground truth and the AI score. If the signals were conditionally independent, we
would observe the AI score to offer no predictive power on the human report after conditioning
on the ground truth. As shown in Table C.18 we can reject this null hypothesis.
Here, we describe the method we use to estimate the Bayesian benchmark π(ωi = 1|sE A
ir , si ).
This procedure is done separately for each pathology. We train a random forest classifier
that predicts the ground truth based on features including the vector of a radiologist’s re-
ported probabilities in the non-AI treatment and the vector of AI predictions. Additional
features include radiologist identifiers to allow for heterogeneity in radiologists’ assessments,
an indicator equal to one if the case was read with clinical history, and summaries of the
patient clinical history. We estimate this quantity for various parameterizations of sE ir and si
A
described in Section 5. These are used in the model testing exercise to understand if radiol-
ogists account for the joint distribution of signals when forming their posterior beliefs. The
hyperparameters of the model are tuned using grouped cross-validation where observations
were grouped by patient id to avoid overfitting with five folds. We impose monotonicity
constraints on the model to impose that π(ωi = 1|sE A
ir , si ) is monotonically increasing in
all probability inputs. When the model includes clinical history, we provide a summarized
patient’s clinical record with their sex, weight, temperature, pulse, age, and the number of
available and flagged labs. We mean impute these variables when the radiologist does not
have access to the clinical history and include an indicator equal to one if the radiologist had
access to the clinical history as an additional feature. Below we summarize the performance
of these models and the relative value of increasing the dimension of sE A
ir and si .
Table C.19: Summary of Bayesian Models
(1) (2) (3) (4) (5) (6) (7) (8) (9) (10) (11) (12)
Accuracy 0.88 0.91 0.88 0.89 0.90 0.90 0.89 0.90 0.89 0.89 0.91 0.90
Airspace Opacity
AUC 0.89 0.96 0.91 0.92 0.96 0.96 0.94 0.96 0.93 0.93 0.96 0.96
Accuracy 0.90 0.92 0.90 0.91 0.93 0.92 0.92 0.92 0.91 0.91 0.93 0.93
Cardiomediastinal Abnorm.
AUC 0.90 0.96 0.91 0.93 0.96 0.96 0.94 0.96 0.94 0.94 0.96 0.96
Accuracy 0.88 0.91 0.88 0.89 0.91 0.91 0.90 0.91 0.90 0.90 0.92 0.91
Abnormal
AUC 0.91 0.96 0.91 0.93 0.96 0.96 0.95 0.96 0.94 0.94 0.96 0.96
Focal sA ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓
Other sA ✓ ✓ ✓ ✓ ✓ ✓
Focal sE ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓
Other sE ✓ ✓ ✓ ✓
Clinical History sE ✓ ✓ ✓ ✓ ✓ ✓
Note: This table summarizes the models used to estimate the Bayesian benchmark. Each column corresponds to a random forest classification tree with varying signal
structures. The rows Focal sA , Other sA , Focal sE , Other sE , and Clinical History sE indicate what features are included in the tree. Focal sA corresponds to the focal
pathology’s AI score, Other sA corresponds to vector of AI scores for all pathologies, Focal (Other) sE includes the radiologist’s report without AI assistance on the focal
pathology (all pathologies), and Clinical History sE contains summaries of the patient’s clinical history when available to the radiologist.
105
106
Here, we present the model selection results for the remaining pre-registered pathology
groups.
Table C.20: Model Selection: Top Level with AI
(1) (2) (3) (4) (5) (6) (7) (8) (9) (10) (11) (12)
Automation Bias (b) 0.29 0.35 0.34 0.31 0.35 0.35 0.19 0.34 0.27 0.25 0.34 0.35
(0.02) (0.03) (0.03) (0.03) (0.03) (0.03) (0.02) (0.03) (0.02) (0.03) (0.03) (0.03)
Own Info Bias (d) 1.10 1.06 1.08 1.08 1.06 1.05 1.07 1.06 1.06 1.07 1.06 1.05
(0.01) (0.01) (0.01) (0.01) (0.01) (0.01) (0.01) (0.01) (0.01) (0.01) (0.01) (0.01)
Constant 0.40 0.37 0.40 0.38 0.37 0.37 0.31 0.38 0.33 0.34 0.37 0.36
(0.03) (0.04) (0.04) (0.04) (0.04) (0.04) (0.03) (0.04) (0.03) (0.04) (0.04) (0.04)
Focal sA ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓
Other sA ✓ ✓ ✓ ✓ ✓ ✓
Focal sE ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓
Other sE ✓ ✓ ✓ ✓
Clinical History sE ✓ ✓ ✓ ✓ ✓ ✓
J-Statistic 18.23 0.03 16.86 10.45 0.14 0.17 4.3 0.02 10.7 6.13 0.06 0.09
MMSC-BIC -23.96 -8.41 -21.12 -10.64 -8.3 -8.27 -12.57 -8.42 -10.39 -10.75 -8.38 -8.35
Number of Moments 13 5 12 8 5 5 7 5 8 7 5 5
Observations 11420 11420 11420 11420 11420 11420 11420 11420 11420 11420 11420 11420
R-Squared 0.50 0.52 0.49 0.49 0.52 0.52 0.46 0.51 0.48 0.46 0.52 0.52
Note: This table presents results of the model selection exercise as described in Table 3, including the full set of models that were included in the selection procedure.
107
Table C.21: Model Selection: Pooled with AI
(1) (2) (3) (4) (5) (6) (7) (8) (9) (10) (11) (12)
Automation Bias (b) 0.16 0.09 0.17 0.14 0.10 0.10 0.12 0.10 0.13 0.14 0.10 0.10
(0.01) (0.01) (0.02) (0.02) (0.01) (0.01) (0.01) (0.01) (0.02) (0.02) (0.01) (0.01)
Own Info Bias (d) 1.12 1.10 1.12 1.12 1.10 1.10 1.11 1.10 1.11 1.12 1.10 1.10
(0.01) (0.01) (0.01) (0.01) (0.01) (0.01) (0.01) (0.01) (0.01) (0.01) (0.01) (0.01)
Constant 0.37 0.31 0.36 0.37 0.31 0.31 0.34 0.31 0.34 0.36 0.31 0.31
(0.04) (0.04) (0.04) (0.04) (0.04) (0.04) (0.04) (0.04) (0.04) (0.04) (0.04) (0.04)
Focal sA ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓
Other sA ✓ ✓ ✓ ✓ ✓ ✓
Focal sE ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓
Other sE ✓ ✓ ✓ ✓
Clinical History sE ✓ ✓ ✓ ✓ ✓ ✓
J-Statistic 12.52 3.85 10.53 4.81 2.03 2.71 4.64 3.84 3.98 5.2 2.44 2.4
MMSC-BIC -12.79 -13.03 -10.57 -12.07 -14.85 -14.17 -12.24 -13.03 -12.9 -11.68 -14.43 -14.48
Number of Moments 9 7 8 7 7 7 7 7 7 7 7 7
Observations 57100 57100 57100 57100 57100 57100 57100 57100 57100 57100 57100 57100
R-Squared 0.45 0.42 0.44 0.43 0.42 0.42 0.43 0.42 0.43 0.43 0.42 0.42
Note: This table presents results of the model selection exercise as described in Table 3, though this table only includes all pathologies with AI assistance.
108
Table C.22: Model Selection: Abnormal
(1) (2) (3) (4) (5) (6) (7) (8) (9) (10) (11) (12)
Automation Bias (b) 0.11 0.06 0.10 0.11 0.06 0.06 0.08 0.06 0.08 0.08 0.06 0.06
(0.03) (0.02) (0.03) (0.03) (0.02) (0.02) (0.02) (0.02) (0.03) (0.03) (0.02) (0.02)
Own Info Bias (d) 1.06 1.07 1.06 1.07 1.07 1.07 1.06 1.07 1.06 1.07 1.07 1.07
(0.02) (0.02) (0.02) (0.02) (0.02) (0.02) (0.02) (0.02) (0.02) (0.02) (0.02) (0.02)
Constant 0.38 0.29 0.36 0.39 0.30 0.29 0.34 0.29 0.34 0.34 0.29 0.29
(0.07) (0.06) (0.07) (0.07) (0.06) (0.06) (0.06) (0.06) (0.07) (0.07) (0.06) (0.06)
Focal sA ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓
Other sA ✓ ✓ ✓ ✓ ✓ ✓
Focal sE ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓
Other sE ✓ ✓ ✓ ✓
Clinical History sE ✓ ✓ ✓ ✓ ✓ ✓
J-Statistic 12.65 15.83 13.55 12.74 15.0 15.43 15.8 15.95 16.19 14.87 15.78 15.41
MMSC-BIC -25.33 -22.14 -24.43 -25.24 -22.98 -22.54 -22.17 -22.03 -21.78 -23.1 -22.2 -22.56
Number of Moments 12 12 12 12 12 12 12 12 12 12 12 12
Observations 5710 5710 5710 5710 5710 5710 5710 5710 5710 5710 5710 5710
R-Squared 0.37 0.32 0.34 0.35 0.32 0.32 0.34 0.32 0.33 0.33 0.32 0.32
Note: This table presents results of the model selection exercise as described in Table 3, though this table only includes Abnormal.
109
110
This section presents the results of the model selection exercise not accounting for measure-
ment error in the human signal. In these analyses, the instruments are constructed using the
radiologist’s report on the case in the treatment arm without AI assistance. Note that some
time elapses between the reads, so the radiologist likely observes a different draw of sE intro-
ducing measurement error into the right-hand side variables of equation 7. This is why the
preferred method uses instruments constructed using a leave-one-out average of reports for
the case. Table C.23 presents results for top-level pathologies with AI, Table C.24 presents
results for all pathologies with AI, and Table C.25 presents results for abnormal.
Table C.23: Model Selection: Top Level with AI without Accounting for Measurement Error
(1) (2) (3) (4) (5) (6) (7) (8) (9) (10) (11) (12)
Automation Bias (b) 0.49 0.37 0.56 0.53 0.37 0.38 0.42 0.37 0.48 0.52 0.36 0.38
(0.02) (0.02) (0.03) (0.03) (0.02) (0.02) (0.02) (0.02) (0.03) (0.02) (0.02) (0.02)
Own Info Bias (d) 1.00 0.89 0.95 0.95 0.90 0.90 0.93 0.90 0.91 0.94 0.90 0.89
(0.02) (0.02) (0.02) (0.02) (0.02) (0.02) (0.02) (0.02) (0.02) (0.02) (0.02) (0.02)
Constant 0.32 0.13 0.29 0.31 0.12 0.13 0.21 0.14 0.20 0.27 0.14 0.12
(0.04) (0.04) (0.04) (0.04) (0.04) (0.04) (0.04) (0.04) (0.04) (0.04) (0.04) (0.04)
Focal sA ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓
Other sA ✓ ✓ ✓ ✓ ✓ ✓
Focal sE ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓
Other sE ✓ ✓ ✓ ✓
Clinical History sE ✓ ✓ ✓ ✓ ✓ ✓
J-Statistic 3.47 1.06 2.04 2.94 4.6 1.98 0.01 1.75 0.0 3.7 9.19 0.0
MMSC-BIC -4.97 -7.38 -2.18 -1.28 -3.84 -6.46 -4.21 -10.91 0.0 -0.52 -7.69 0.0
Number of Moments 5 5 4 4 5 5 4 6 3 4 7 3
Observations 11420 11420 11420 11420 11420 11420 11420 11420 11420 11420 11420 11420
R-Squared 0.58 0.55 0.56 0.56 0.55 0.55 0.56 0.55 0.56 0.56 0.55 0.55
Note: This table presents results of the model selection exercise as described in Table 3 for top-level pathologies with AI assistance without accounting for measurement
error in the radiologist reports.
111
Table C.24: Model Selection: Pooled with AI without Accounting for Measurement Error
(1) (2) (3) (4) (5) (6) (7) (8) (9) (10) (11) (12)
Automation Bias (b) 0.45 0.35 0.45 0.46 0.36 0.35 0.39 0.35 0.41 0.45 0.36 0.35
(0.02) (0.01) (0.02) (0.02) (0.01) (0.01) (0.02) (0.01) (0.02) (0.02) (0.01) (0.01)
Own Info Bias (d) 1.01 0.93 0.97 0.98 0.93 0.94 0.96 0.93 0.95 0.97 0.94 0.93
(0.01) (0.01) (0.01) (0.01) (0.01) (0.01) (0.01) (0.01) (0.01) (0.01) (0.01) (0.01)
Constant 0.21 0.00 0.10 0.14 -0.01 0.01 0.07 0.00 0.04 0.12 0.01 -0.01
(0.04) (0.04) (0.04) (0.04) (0.04) (0.04) (0.04) (0.04) (0.04) (0.04) (0.04) (0.04)
Focal sA ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓
Other sA ✓ ✓ ✓ ✓ ✓ ✓
Focal sE ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓
Other sE ✓ ✓ ✓ ✓
Clinical History sE ✓ ✓ ✓ ✓ ✓ ✓
J-Statistic 11.7 0.0 0.0 1.18 4.21 7.31 0.0 8.13 0.0 0.0 2.85 8.57
MMSC-BIC -13.62 0.0 0.0 -3.04 -12.67 -9.57 0.0 -8.75 0.0 0.0 -18.25 -8.31
Number of Moments 9 3 3 4 7 7 3 7 3 3 8 7
Observations 57100 57100 57100 57100 57100 57100 57100 57100 57100 57100 57100 57100
R-Squared 0.58 0.56 0.56 0.56 0.56 0.56 0.57 0.56 0.56 0.56 0.56 0.56
Note: This table presents results of the model selection exercise as described in Table 3 for all pathologies with AI assistance without accounting for measurement error in
the radiologist reports.
112
Table C.25: Model Selection: Abnormal without Accounting for Measurement Error
(1) (2) (3) (4) (5) (6) (7) (8) (9) (10) (11) (12)
Automation Bias (b) 0.47 0.35 0.44 0.42 0.37 0.37 0.36 0.36 0.36 0.41 0.37 0.36
(0.02) (0.02) (0.03) (0.02) (0.02) (0.02) (0.02) (0.02) (0.02) (0.02) (0.02) (0.02)
Own Info Bias (d) 0.92 0.89 0.86 0.90 0.88 0.88 0.85 0.88 0.82 0.87 0.87 0.88
(0.02) (0.02) (0.02) (0.02) (0.02) (0.02) (0.02) (0.02) (0.02) (0.02) (0.02) (0.02)
Constant 1.23 0.98 1.16 1.12 1.02 1.05 0.99 1.04 1.01 1.11 1.05 1.00
(0.07) (0.06) (0.08) (0.07) (0.06) (0.06) (0.06) (0.06) (0.07) (0.07) (0.07) (0.06)
Focal sA ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓
Other sA ✓ ✓ ✓ ✓ ✓ ✓
Focal sE ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓
Other sE ✓ ✓ ✓ ✓
Clinical History sE ✓ ✓ ✓ ✓ ✓ ✓
J-Statistic 12.09 7.1 0.35 4.71 3.06 4.94 0.0 5.79 0.0 0.34 0.0 4.85
MMSC-BIC -13.23 -5.56 -3.87 -3.73 -1.16 -3.5 0.0 -2.65 0.0 -3.88 0.0 -3.59
Number of Moments 9 6 4 5 4 5 3 5 3 4 3 5
Observations 5710 5710 5710 5710 5710 5710 5710 5710 5710 5710 5710 5710
R-Squared 0.56 0.54 0.53 0.54 0.54 0.54 0.54 0.54 0.53 0.53 0.54 0.54
Note: This table presents results of the model selection exercise as described in Table 3 for abnormal without accounting for measurement error in the radiologist reports.
113
114
4
Estimated Linear
Estimated Non-linear
2
|ω=1)
π(sA |ω=0)
A
0
log π(s
−2
−4
−4 −2 0 2 4
E
π(ω=1|s )
log π(ω=0|s E)
Note: For the two top-level pathologies with AI assistance, we plot estimates of the indifference frontier where radiologists are
indifferent between following up on a patient-case and not following up on a patient-case for a Bayesian decision maker, the
linear model estimated in equation 7, a non-parametric version of equation 7, and the human only without AI assistance. In
addition, we plot the joint distribution of the signals log-likelihoods.
Here we show the distribution of individual estimates of equation (7). Each estimate contains
sampling error, as each radiologist is only reading a subset of cases. Therefore, the unadjusted
distribution of individual-level estimates is overdispersed as it is a convolution of the true
individual-level parameters and the sampling noise. We adjust for this over-dispersion using
115
a Bayesian hierarchical model, where we model the individual parameter vector θr as follows.
θr ∼N (µ, Σ)
µ ∼N (0, 100I)
Σ =diag (τ ) Ωdiag (τ )
τk ∼Cauchy (0, 2.5)
Ω ∼LKJCorr (10)
We sample from the posterior of this model and plot the marginal distribution of the posterior
means of θr below.
Density
2.0
1.5
1.5
1.0
1.0
0.5 0.5
0.0 0.0
0.0 0.2 0.4 0.6 0.8 1.0 1.2 0.0 0.2 0.4 0.6 0.8 1.0 1.2
Individual b / d Estimates Individual b / d Estimates
Note: Marginal distributions of individual b and d estimates by radiologist for top level pathologies with AI and all pathologies
with AI.
In the experiment we elicit both probability assessment and treatment decisions, allowing us
to identify the relative costs of false positives and false negatives the radiologists are using.
Recall that radiologist r chooses to treat or follow-up on pathology p in patient case i under
treatment t if aritp = 1 where aritp is given by
" #
pritp
aritp =1 − crp
rel + εritp > 0
1 − pritp
P (aritp = 1) pritp
log = β0 + β log + αp + γr (9)
1 − P (aritp = 1) 1 − pritp
where αp are pathology fixed effects and γr are radiologist fixed effects. The relative costs
of false positives to false negatives for radiologist r and pathology p can then be found as
h i
β0 +γr +αp
crp
rel = exp − β
. For each pathology, we winsorize radiologists’ relative costs at the
5th and 95th percentile. The results of this exercise are presented in Table C.26.