Assessing The Utility of ChatGPT Throughout The Entire Clinical Workflow
Assessing The Utility of ChatGPT Throughout The Entire Clinical Workflow
Arya Rao*a,b, BA, Michael Pang*a,b, BS, John Kima,b, BA, Meghana Kaminenia,b, BS,
Winston Liea,b, BA MSc, Anoop K. Prasada,b, MBBS, Adam Landmanc, MD, MS, MIS,
MHS, Keith J Dreyerd, PhD, DO, Marc D. Succia,b, MD
Corresponding Author:
Marc D. Succi, MD
Massachusetts General Hospital
Department of Radiology
55 Fruit Street
Boston, MA
02114
Phone: 617-935-9144
Email: [email protected]
@MarcSucciMD
ORCiD: 0000-0002-1518-3984
a
Harvard Medical School, Boston, MA
b
Medically Engineered Solutions in Healthcare Incubator, Innovation in Operations
Research Center (MESH IO), Massachusetts General Hospital, Boston, MA
c
Brigham and Women’s Hospital, Boston, MA
d
MGH and BWH Data Science Center, Massachusetts General Hospital, Boston, MA
Data Sharing Statement: All data generated or analyzed during the study are included
in the published paper.
NOTE: This preprint reports new research that has not been certified by peer review and should not be used to guide clinical practice.
medRxiv preprint doi: https://doi.org/10.1101/2023.02.21.23285886; this version posted February 26, 2023. The copyright holder for this preprint
(which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.
All rights reserved. No reuse allowed without permission.
Abstract
IMPORTANCE: Large language model (LLM) artificial intelligence (AI) chatbots direct
the power of large training datasets towards successive, related tasks, as opposed to
single-ask tasks, for which AI already achieves impressive performance. The capacity of
LLMs to assist in the full scope of iterative clinical reasoning via successive prompting,
in effect acting as virtual physicians, has not yet been evaluated.
OBJECTIVE: To evaluate ChatGPT’s capacity for ongoing clinical decision support via
its performance on standardized clinical vignettes.
DESIGN: We inputted all 36 published clinical vignettes from the Merck Sharpe &
Dohme (MSD) Clinical Manual into ChatGPT and compared accuracy on differential
diagnoses, diagnostic testing, final diagnosis, and management based on patient age,
gender, and case acuity.
SETTING: ChatGPT, a publicly available LLM
PARTICIPANTS: Clinical vignettes featured hypothetical patients with a variety of age
and gender identities, and a range of Emergency Severity Indices (ESIs) based on initial
clinical presentation.
EXPOSURES: MSD Clinical Manual vignettes
MAIN OUTCOMES AND MEASURES: We measured the proportion of correct
responses to the questions posed within the clinical vignettes tested.
RESULTS: ChatGPT achieved 71.7% (95% CI, 69.3% to 74.1%) accuracy overall
across all 36 clinical vignettes. The LLM demonstrated the highest performance in
making a final diagnosis with an accuracy of 76.9% (95% CI, 67.8% to 86.1%), and the
lowest performance in generating an initial differential diagnosis with an accuracy of
60.3% (95% CI, 54.2% to 66.6%). Compared to answering questions about general
medical knowledge, ChatGPT demonstrated inferior performance on differential
diagnosis (β=-15.8%, p<0.001) and clinical management (β=-7.4%, p=0.02) type
questions.
CONCLUSIONS AND RELEVANCE: ChatGPT achieves impressive accuracy in clinical
decision making, with particular strengths emerging as it has more clinical information at
its disposal.
medRxiv preprint doi: https://doi.org/10.1101/2023.02.21.23285886; this version posted February 26, 2023. The copyright holder for this preprint
(which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.
All rights reserved. No reuse allowed without permission.
Introduction
Despite its relative infancy, artificial intelligence (AI) is transforming healthcare, with
current uses including workflow triage, predictive models of utilization, labeling and
interpretation of radiographic images, patient support via interactive chatbots,
communication aids for non-English speaking patients, and more. Yet, all of these use 1–8
cases are limited to a specific part of the clinical workflow and do not provide
longitudinal patient or clinician support. An under-explored use of AI in medicine is
predicting and synthesizing patient diagnoses, treatment plans, and outcomes. Until
recently, AI models have lacked sufficient accuracy and power to engage meaningfully
in the clinical decision-making space. However, the advent of large language models
(LLMs), which are trained on large amounts of human-generated text like the Internet,
has motivated further investigation into whether AI can serve as an adjunct in clinical
decision making throughout the entire clinical workflow, from triage to diagnosis to
management. In this study, we assess the performance of a novel LLM, ChatGPT, on
comprehensive clinical vignettes (short, hypothetical patient cases used to test clinical
knowledge and reasoning).
texts as found in biomedical literature. Recently, there has been great interest in
15
utilizing the nascent but powerful chatbot for clinical decision support. 16–18
Given that LLMs like ChatGPT have the ability to integrate large amounts of textual
information to synthesize responses to human-generated prompts, we speculated that
ChatGPT would be able to act as an on-the-ground copilot in clinical reasoning, making
use of the wealth of information available during patient care from the Electronic Health
Record (EHR) and other sources. We focused on comprehensive clinical vignettes as a
model, and tested the hypothesis that when provided clinical vignettes, ChatGPT would
be able to recommend diagnostic workup, decide the clinical management course, and
ultimately make the diagnosis, working through the entire clinical encounter.
Our study is the first to make use of ChatGPT’s ability to integrate information from the
earlier portions of a conversation into downstream responses. Thus, this model lends
itself well to the iterative nature of clinical medicine, in that the influx of new information
requires constant updating of prior hypotheses.
medRxiv preprint doi: https://doi.org/10.1101/2023.02.21.23285886; this version posted February 26, 2023. The copyright holder for this preprint
(which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.
All rights reserved. No reuse allowed without permission.
Methods
Study Design
We assessed ChatGPT’s accuracy in solving comprehensive clinical vignettes,
comparing across patient age, gender, and acuity of clinical presentation. We presented
each portion of the clinical workflow as a successive prompt to the model (differential
diagnosis, diagnostic testing, final diagnosis, and clinical management questions were
presented one after the other) (Figure 1A).
Setting
ChatGPT (San Francisco, OpenAI) is a transformer-based language model with the
ability to generate human-like text. It captures the context and relationship between
words in input sequences through multiple layers of self-attention and feed-forward
neural networks. The language model is trained on a variety of text including websites,
articles, and books up until 2021. The ChatGPT model is self-contained in that it does
not have the ability to search the internet when generating responses. Instead, it
predicts the most likely “token” to succeed the previous one based on patterns in its
training data. Therefore, it does not explicitly search through existing information, nor
does it copy existing information. All ChatGPT model output was collected from the
January 9, 2023 version of ChatGPT.
Case transcripts were generated by copying MSD manual vignettes directly into
ChatGPT. Questions posed in the MSD manual vignettes were presented as successive
inputs to ChatGPT (Figure 1B). All questions requesting the clinician to analyze images
were excluded from our study, as ChatGPT is a text-based AI without the ability to
interpret visual information.
ChatGPT’s answers are informed by the context of the ongoing conversation. To avoid
the influence of other vignettes’ answers on model output, a new ChatGPT session was
instantiated for each vignette. A single session was maintained for each vignette and for
all associated questions, allowing ChatGPT to take all available vignette information into
account as it proceeds to answer new questions. To account for response-by-response
medRxiv preprint doi: https://doi.org/10.1101/2023.02.21.23285886; this version posted February 26, 2023. The copyright holder for this preprint
(which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.
All rights reserved. No reuse allowed without permission.
variation, each vignette was tested in triplicate, each time by a different user. Prompts
were not modified from user to user.
We awarded points for each correct answer given by ChatGPT and noted the total
number of correct decisions possible for each question. For example, for a question
asking whether each of a list of diagnostic tests is appropriate for the patient presented,
a point was awarded for each time ChatGPT’s answer was concordant with the
provided Merck answer.
Two scorers independently calculated an individual score for each output to ensure
consensus on all output scores; there were no scoring discrepancies. The final score for
each prompt was calculated as an average of the three replicate scores. Based on the
total possible number of correct decisions per question, we calculated a proportion of
correct decisions for each question (“average proportion correct” refers to the average
proportion across replicates). A schematic of the workflow is provided in Figure 1A.
vignettes. The ESI is a five-level triage algorithm to assign patient priority in the
emergency department. Assessment is based on medical urgency and assesses the
patient’s chief complaint, vital signs, and ability to ambulate. The ESI is an ordinal scale
ranging from 1 to 5 corresponding to highest to lowest acuity respectively. For each
vignette, we fed the HPI into ChatGPT to determine its ESI and cross-validated with
human ESI scoring. All vignette metadata, including title, age, gender, ESI, and final
diagnosis, can be found in eTable1.
Questions posed by the MSD Manual vignettes fall into several categories: differential
diagnoses (abbreviated as diff) which ask the user to determine which of several
conditions cannot be eliminated from an initial differential, diagnostic questions
(abbreviated as diag) which ask the user to determine appropriate diagnostic steps
based on the current hypotheses and information, diagnosis questions (abbreviated as
dx) which ask the user for a final diagnosis, management questions (abbreviated as
mang) which ask the user to recommend appropriate clinical interventions, and
miscellaneous questions (abbreviated as misc) which ask the user medical knowledge
questions relevant to the vignette, but not necessarily specific to the patient at hand. We
stratified results by question type and the demographic information previously
described.
Statistical Methods
Multivariable linear regression was performed using the lm() function with R version
4.2.1 (Vienna, R Core Team) to assess the relationship between ChatGPT vignette
performance, question type, demographic variables (age, gender), and clinical acuity
(ESI). Question type was dummy-variable-encoded to assess the effect of each
medRxiv preprint doi: https://doi.org/10.1101/2023.02.21.23285886; this version posted February 26, 2023. The copyright holder for this preprint
(which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.
All rights reserved. No reuse allowed without permission.
category independently. The misc question type was chosen as the reference variable
as these questions assess general knowledge and not necessarily active clinical
reasoning. Age, gender, and ESI were also included in the model to control for potential
sources of confounding.
medRxiv preprint doi: https://doi.org/10.1101/2023.02.21.23285886; this version posted February 26, 2023. The copyright holder for this preprint
(which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.
All rights reserved. No reuse allowed without permission.
Results
Overall Performance
Since questions from all vignettes fall into several distinct categories, we were able to
assess performance not only on a vignette-by-vignette basis, but also on a category-by-
category basis. We found that on average, across all vignettes, ChatGPT achieved
71.8% accuracy (Figure 2A, eTable 2, eTable3). Between categories and across all
vignettes, ChatGPT achieved the highest accuracy (76.9%) for questions in the dx
category, and the lowest accuracy for questions in the diff category (60.3%) (Figure 2B,
eTable3). Trends for between-question-type variation in accuracy for each vignette are
shown in Figure 2C.
Vignette #28, featuring a right testicular mass in a 28-year-old man (final diagnosis of
testicular cancer), showed the highest accuracy overall (83.8%). Vignette #27, featuring
recurrent headaches in a 31-year-old woman (final diagnosis of pheochromocytoma),
showed the lowest accuracy overall (55.9%) (Figure 2A, eTable2).
Discussion
In this study, we present first-of-its-kind evidence assessing the potential use of novel
artificial intelligence tools throughout the entire clinical workflow, encompassing initial
diagnostic workup, diagnosis, and clinical management. We provide the first analysis of
ChatGPT’s iterative prompt functionality in the clinical setting, reflecting the constantly
shifting nature of patient care by allowing upstream prompts and responses to affect
downstream answers. We show that ChatGPT achieves 60.3% accuracy in determining
differential diagnoses based on the HPI, PE, and ROS alone. With additional
information such as the results of relevant diagnostic testing, ChatGPT achieves 76.9%
accuracy in narrowing a final diagnosis.
ChatGPT achieves an average performance of 71.8% across all vignettes and question
types. Notably, of the patient-focused questions posed by each vignette, ChatGPT
achieved the highest accuracy (76.9% on average) answering dx questions, which
prompted the model to provide a final diagnosis based on HPI, PE, ROS, diagnostic
results, and any other pertinent clinical information. There was no statistical difference
between dx accuracy and misc accuracy, indicating that ChatGPT performance on a
specific clinical case, when provided with all possible relevant clinical information,
approximates its accuracy in providing general medical facts.
Overall accuracy was lower for diag and mang questions than for diff and dx questions
(Figure 2B). In some cases, this was because ChatGPT recommended extra or
unnecessary diagnostic testing or clinical intervention, respectively (eTable 4). In
contrast, for several diff and dx questions (for which all necessary information was
provided to answer, as for the diag and mang questions), ChatGPT refused to provide a
diagnosis altogether (eTable 4). This indicates ChatGPT is not always able to properly
navigate clinical scenarios with a well-established standard of care (ex. a clear
diagnosis based on a canonical presentation) and situations in which the course of
action is more ambiguous (ex. ruling out unnecessary testing). The latter observation is
in line with Rao et al.’s observation that ChatGPT struggles to identify situations in
which diagnostic testing is futile. Resource utilization was not explicitly tested in our
17
Rao et al. found that for breast cancer and breast pain screening, ChatGPT’s accuracy
in determining appropriate radiologic diagnostic workup varied with the severity of initial
presentation. For breast cancer, there was a positive correlation between severity and
accuracy, and for breast pain there was a negative correlation. Given that the data in 17
this study covers 36 different clinical scenarios as opposed to trends within specific
clinical conditions, we suspect that any association between acuity of presentation and
accuracy could be found on a within-case basis, as opposed to between cases.
Given the important ongoing discourse surrounding bias in the clinical setting and bias
3–8
age and gender of patients represents an important touchpoint in both discussions. 21–25
While we did not find that age or gender is a significant predictor of accuracy, we note
that our vignettes represent classic presentations of disease, and that atypical
presentations may generate different biases. Further investigation into additional
demographic variables and possible sources of systematic bias is warranted in future
studies.
While on the surface ChatGPT performs impressively, it is worth noting that even small
errors in clinical judgment can result in adverse outcomes. ChatGPT’s answers are
generated based on finding the next most likely “token” or word/phrase to complete the
ongoing answer; as such, ChatGPT lacks reasoning capacity. This is evidenced by
instances in which ChatGPT recommends futile care or refuses to provide a diagnosis
even when equipped with all necessary information and is further evidenced by its
frequent errors in dosing. These limitations are inherent to the artificial intelligence
model itself and can be broadly divided into several categories, including misalignment
and hallucination. In this study, we identified and accounted for these limitations with
26,27
replicate validation. These considerations are necessary when determining both the
parameters of artificial intelligence utilization in the clinical workflow and the regulations
surrounding the approval of similar technologies in clinical settings.
References
1. Yu KH, Beam AL, Kohane IS. Artificial intelligence in healthcare. Nat Biomed Eng.
2018;2(10):719-731. doi:10.1038/s41551-018-0305-z
2. Xu L, Sanders L, Li K, Chow JCL. Chatbot for Health Care and Oncology Applications
Using Artificial Intelligence and Machine Learning: Systematic Review. JMIR Cancer.
2021;7(4):e27850. doi:10.2196/27850
3. Chonde DB, Pourvaziri A, Williams J, et al. RadTranslate: An Artificial Intelligence–
Powered Intervention for Urgent Imaging to Enhance Care Equity for Patients With
Limited English Proficiency During the COVID-19 Pandemic. J Am Coll Radiol.
2021;18(7):1000-1008. doi:10.1016/j.jacr.2021.01.013
4. Chung J, Kim D, Choi J, et al. Prediction of oxygen requirement in patients with COVID-
19 using a pre-trained chest radiograph xAI model: efficient development of auditable
risk prediction models via a fine-tuning approach. Sci Rep. 2022;12(1):21164.
doi:10.1038/s41598-022-24721-5
5. Li MD, Arun NT, Aggarwal M, et al. Multi-population generalizability of a deep learning-
based chest radiograph severity score for COVID-19. Medicine (Baltimore).
2022;101(29):e29587. doi:10.1097/MD.0000000000029587
6. Kim D, Chung J, Choi J, et al. Accurate auto-labeling of chest X-ray images based on
quantitative similarity to an explainable AI model. Nat Commun. 2022;13(1):1867.
doi:10.1038/s41467-022-29437-8
7. O’Shea A, Li MD, Mercaldo ND, et al. Intubation and mortality prediction in hospitalized
COVID-19 patients using a combination of convolutional neural network-based scoring
of chest radiographs and clinical data. BJR|Open. 2022;4(1):20210062.
doi:10.1259/bjro.20210062
8. Witowski J, Choi J, Jeon S, et al. MarkIt: A Collaborative Artificial Intelligence
Annotation Platform Leveraging Blockchain For Medical Imaging Research. Blockchain
Healthc Today. Published online May 5, 2021. doi:10.30953/bhty.v4.176
9. ChatGPT: Optimizing Language Models for Dialogue. OpenAI. Published November 30,
2022. Accessed February 15, 2023. https://openai.com/blog/chatgpt/
10. Kung TH, Cheatham M, ChatGPT, et al. Performance of ChatGPT on USMLE:
Potential for AI-Assisted Medical Education Using Large Language Models. Published
online December 21, 2022:2022.12.19.22283643. doi:10.1101/2022.12.19.22283643
11. Bommarito II M, Katz DM. GPT Takes the Bar Exam. Published online December
29, 2022. doi:10.48550/arXiv.2212.14402
12. Choi JH, Hickman KE, Monahan A, Schwarcz D. ChatGPT Goes to Law School.
Published online January 23, 2023. doi:10.2139/ssrn.4335905
13. Bommarito J, Bommarito M, Katz DM, Katz J. GPT as Knowledge Worker: A
Zero-Shot Evaluation of (AI)CPA Capabilities. Published online January 11, 2023.
doi:10.48550/arXiv.2301.04408
14. Terwiesch C. Would Chat GPT Get a Wharton MBA?
15. Flanagin A, Bibbins-Domingo K, Berkwits M, Christiansen SL. Nonhuman
“Authors” and Implications for the Integrity of Scientific Publication and Medical
Knowledge. JAMA. Published online January 31, 2023. doi:10.1001/jama.2023.1344
medRxiv preprint doi: https://doi.org/10.1101/2023.02.21.23285886; this version posted February 26, 2023. The copyright holder for this preprint
(which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.
All rights reserved. No reuse allowed without permission.
16. Bates DW, Levine D, Syrowatka A, et al. The potential of artificial intelligence to
improve patient safety: a scoping review. Npj Digit Med. 2021;4(1):1-8.
doi:10.1038/s41746-021-00423-6
17. Rao A, Kim J, Kamineni M, Pang M, Lie W, Succi MD. Evaluating ChatGPT as
an Adjunct for Radiologic Decision-Making. Published online February 7,
2023:2023.02.02.23285399. doi:10.1101/2023.02.02.23285399
18. Levine DM, Tuwani R, Kompa B, et al. The Diagnostic and Triage Accuracy of
the GPT-3 Artificial Intelligence Model. Published online February 1,
2023:2023.01.30.23285067. doi:10.1101/2023.01.30.23285067
19. Case studies. Merck Manuals Professional Edition. Accessed February 1, 2023.
https://www.merckmanuals.com/professional/pages-with-widgets/case-
studies?mode=list
20. Eitel DR, Rudkin SE, Malvehy MA, Killeen JP, Pines JM. Improving Service
Quality by Understanding Emergency Department Flow: A White Paper and Position
Statement Prepared For the American Academy of Emergency Medicine. J Emerg Med.
2010;38(1):70-79. doi:10.1016/j.jemermed.2008.03.038
21. Byrne MD. Reducing Bias in Healthcare Artificial Intelligence. J Perianesth Nurs.
2021;36(3):313-316. doi:10.1016/j.jopan.2021.03.009
22. Panch T, Mattie H, Atun R. Artificial intelligence and algorithmic bias: implications
for health systems. J Glob Health. 9(2):020318. doi:10.7189/jogh.09.020318
23. Institute of Medicine (US) Committee on Understanding and Eliminating Racial
and Ethnic Disparities in Health Care. Unequal Treatment: Confronting Racial and
Ethnic Disparities in Health Care. (Smedley BD, Stith AY, Nelson AR, eds.). National
Academies Press (US); 2003. Accessed February 13, 2023.
http://www.ncbi.nlm.nih.gov/books/NBK220358/
24. Zhang H, Lu AX, Abdalla M, McDermott M, Ghassemi M. Hurtful words:
quantifying biases in clinical contextual word embeddings. In: Proceedings of the ACM
Conference on Health, Inference, and Learning. CHIL ’20. Association for Computing
Machinery; 2020:110-120. doi:10.1145/3368555.3384448
25. Bender EM, Gebru T, McMillan-Major A, Shmitchell S. On the Dangers of
Stochastic Parrots: Can Language Models Be Too Big? 🦜 In: Proceedings of
the 2021 ACM Conference on Fairness, Accountability, and Transparency. FAccT ’21.
Association for Computing Machinery; 2021:610-623. doi:10.1145/3442188.3445922
26. Ji Z, Lee N, Frieske R, et al. Survey of Hallucination in Natural Language
Generation. ACM Comput Surv. Published online November 17, 2022:3571730.
doi:10.1145/3571730
27. Perez F, Ribeiro I. Ignore Previous Prompt: Attack Techniques For Language
Models. Published online November 17, 2022. doi:10.48550/arXiv.2211.09527