0% found this document useful (0 votes)
10 views

Evaluating large language models as agents,自然医学

Evaluating large language models as agents,自然医学

Uploaded by

jinbo zhang
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views

Evaluating large language models as agents,自然医学

Evaluating large language models as agents,自然医学

Uploaded by

jinbo zhang
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 3

npj | digital medicine Comment

Published in partnership with Seoul National University Bundang Hospital

https://doi.org/10.1038/s41746-024-01083-y

Evaluating large language models as agents


in the clinic
Nikita Mehandru, Brenda Y. Miao, Eduardo Rodriguez Almaraz, Madhumita Sushil,
Atul J. Butte & Ahmed Alaa Check for updates

Recent developments in large language models However, the emergent capabilities of LLMs have significantly
(LLMs) have unlocked opportunities for healthcare, expanded their potential beyond conventional, standardized clinical natural
language processing (NLP) tasks that primarily revolve around text pro-
from information synthesis to clinical decision cessing and question answering. Instead, there is a growing emphasis on
support. These LLMs are not just capable of utilizing LLMs for more complex physician- and patient-facing tasks that
may involve multi-step information synthesis, use of external data sources,
modeling language, but can also act as intelligent high-level reasoning, or even simulation of clinical text and conversations8,9.
“agents” that interact with stakeholders in open- In these scenarios, LLMs should not be viewed as models of language,
1234567890():,;
1234567890():,;

ended conversations and even influence clinical but rather as intelligent “agents” that have internal planning capabilities that
allow them to perform complex, multi-step reasoning or interact with tools,
decision-making. Rather than relying on databases, other agents, or external users to better respond to user
benchmarks that measure a model’s ability to requests9,10. Here, we discuss how LLM agents can be used in clinical settings,
and challenges to the development and evaluation of these approaches.
process clinical data or answer standardized test
questions, LLM agents can be modeled in high- Development of LLM agents for clinical use
fidelity simulations of clinical settings and should be LLM agents can be developed for a variety of clinical use cases by providing
the LLM access to different sources of information and tools, including
assessed for their impact on clinical workflows. clinical guidelines, databases containing electronic health records, clinical
These evaluation frameworks, which we refer to as calculators, or other curated clinical software tools9,10. These agents can
respond to user requests by autonomously identifying and retrieving rele-
“Artificial Intelligence Structured Clinical vant information, or performing multi-step analyses to answer questions,
Examinations” (“AI-SCE”), can draw from model data, or produce visualizations. Different agents can also even interact
comparable technologies where machines operate and collaborate with each other in “multi-agent” settings to identify or check
proposed solutions to difficult problems, or to model medical conversations
with varying degrees of self-governance, such as and decision-making processes11.
self-driving cars, in dynamic environments with Healthcare systems are already adopting LLMs capable of powering
clinical agents; for instance, UC San Diego Health is working to integrate
multiple stakeholders. Developing these robust, GPT-4 into MyChart, Epic’s online health portal, to streamline patient
real-world clinical evaluations will be crucial messaging12. Patients also leverage publicly available chatbots (such as
towards deploying LLM agents in medical settings. ChatGPT) to better understand medical vocabulary from clinical notes, and
some medical centers are exploring a “virtual-first” approach where LLMs
The release of ChatGPT, a chatbot powered by a large language model assist in patient triaging13,14. When connected to additional sources of
(LLM), has brought LLMs into the spotlight and unlocked opportunities for information and tools, the versatility and adaptability of clinical agents make
their use in healthcare settings. Med-PaLM 2, Google’s medical LLM, was them well-suited in supporting both routine administrative tasks as well as
found to consistently perform at a human expert level on medical exam- clinical decision support.
ination questions scoring 85%1. While this model, part of Google’s family of
foundation models known as MedLM, are fine-tuned for the healthcare Clinical simulations using agent-based modeling (ABM)
industry, even large LLMs trained on openly available information from the To evaluate the utility and safety of LLM-based chatbots as agents in these
Internet, not just biomedical information, have immense potential to applications, we suggest the use of benchmarks that are not confined to
improve and augment clinical workflows2–4. For instance, the Generative traditional, narrowly-scoped assessments based on NLP benchmarks, which
Pre-trained Transformer-4 (GPT-4) model can generate summaries of consist of predetermined inputs and ground-truths. Instead, approaches
physician–patient encounters from transcripts of conversations5, achieve a from agent-based modeling (ABM)15 can be used to create a simulated
score of 86% on the United States Medical Licensing Examination environment for effective evaluation of LLMs agents. ABM is a computa-
(USMLE)6, and create clinical question-answer pairs that are largely indis- tional framework that simulates the actions and interactions of autonomous
tinguishable from human-generated USMLE questions7. These early agents to provide insights into system-level behavior and outcomes. This
demonstrations of GPT-4 and other LLMs on clinical tasks and benchmarks approach has been used in health policy, biology, and the social sciences to
suggest that these models have the potential to improve and automate conduct studies that simulate health behaviors and the spread of infectious
aspects of clinical tasks. diseases16,17.

npj Digital Medicine | (2024)7:84 1


npj | digital medicine Comment

ABM has also been used to evaluate autonomous agents in the domain how systematic addition or removal of LLM agents affects overall outcomes.
of self-driving cars18. In this field, simulations of real-world environments These evaluations should be used to inform guardrails for clinical LLMs,
containing road obstacles, traffic signals, other cars, and pedestrians can be which have been developed for general-purpose models to constrain their
used to evaluate and refine the behaviors of autonomous vehicle agents as behavior28.
they encounter these different elements19. Similarly, by simulating the One added complexity of assessing agents using an AI-SCE format is
clinical settings where LLM agents may be deployed, including patient- the complicated nature of many clinical tasks, where there may not be
physician interactions and hospital processes, we can use an ABM approach perfect concordance with individual human evaluators. We emphasize the
to evaluate how an LLM agent may interact with users, which tools or data continued need for a panel of human evaluators, and the importance of
an LLM employs to carry out user requests, and points of failure that lead to testing agent outcomes on external datasets. We also recognize the
erroneous outputs or downstream errors. importance of post-deployment monitoring to ensure data distribution
Interestingly, patients and physicians can also be simulated as LLM shifts do not occur over time, and to mitigate bias in model performance25.
agents in ABM environments. Previous research has demonstrated the Furthermore, randomized control trials (RCTs) should be conducted to
feasibility of employing LLMs to create “interactive simulacra” that replicate compare how well these simulation environments capture real-world set-
human behavior9–11. To develop these high-fidelity simulations, data on tings, as well as the real-world impact of LLM agents in augmenting clinical
physician and patient behavior can be derived from real-world electronic workflows.
health records or clinical trial data, ideally with validation from multiple As LLMs evolve and demonstrate increasingly advanced cap-
hospital systems, and encompassing diverse patient populations. De- abilities, their involvement in clinical practice will extend beyond
identified datasets (e.g., MIMIC-IV, UCSF Information Commons) or limited text processing tasks29. In the near future, it may become
federated learning approaches can be used to help protect patient necessary to shift our benchmarks from static datasets to dynamic
privacy20,21. simulation environments and transition from language modeling to
agent modeling. Drawing inspiration from fields such as biology and
Evaluating agent-based simulations using an AI-SCE economics could be beneficial for future LLM research and develop-
framework ment for clinical applications.
Similar to standards and regulations for the autonomous driving industry,
identifying robust clinical guidelines and what constitutes a successful Nikita Mehandru 1,6, Brenda Y. Miao2,6,
interaction for healthcare LLM agents will be crucial towards fulfilling the Eduardo Rodriguez Almaraz 3,4,6, Madhumita Sushil 2,
long-term goals of patients, providers, and other clinical stakeholders. In Atul J. Butte 2,5 & Ahmed Alaa 1,2
1
medical education, there has been a shift from assessing students using University of California, Berkeley, 2195 Hearst Ave, Warren Hall Suite,
standardized testing which evaluates shallow clinical reasoning to modern 120C, Berkeley, CA, USA. 2Bakar Computational Health Sciences Institute,
curricula which increasingly use Objective Structured Clinical Examination University of California San Francisco, San Francisco, CA, USA.
(OSCE)22. These exams assess a student’s practical skills in the clinic, 3
Neurosurgery Department Division of Neuro-Oncology, University of
including the ability to examine patients, take clinical histories, commu- California San Francisco, 400 Parnassus Avenue, 8th floor, RM A808, San
nicate effectively, and handle unexpected situations. Google recently Francisco, CA, USA. 4Department of Epidemiology and Biostatistics,
developed Articulate Medical Intelligence Explorer (AMIE), a research AI University of California San Francisco, 400 Parnassus Avenue, 8th floor,
system for diagnostic medical reasoning and conversations, which was RM A808, San Francisco, CA, USA. 5Department of Pediatrics, University
evaluated against the performance of primary care physicians (PCPs) in the of California San Francisco, San Francisco, CA, USA. 6These authors
style of an OSCE23. contributed equally: Nikita Mehandru, Brenda Y. Miao, Eduardo Rodriguez
Current benchmarks for clinical NLP, including MedQA (USMLE- Almaraz. e-mail: [email protected]
style questions) and MedNLI, test if two clinical statements logically follow
each other and are often also derived from standardized tests or curated Received: 25 August 2023; Accepted: 22 March 2024;
clinical text. This information; however, is not a sufficient metric because it
fails to capture the full range of capabilities demonstrated by clinical LLM
agents24,25. As a result, we call for the development of Artificial Intelligence
Structured Clinical Examinations (AI-SCEs) that can be used to assess the References
ability for LLMs to aid in real-world clinical workflows. These AI-SCE 1. Singhal, et al. Towards expert-level medical question answering with large language models.
Preprint at https://arxiv.org/abs/2305.09617 (2023).
benchmarks, which may be derived from difficult clinical scenarios or from 2. Agrawal, M., Hegselmann, S., Lang, H., Kim, Y. & Sontag, D. Large Language Models are Few-
real-world clinical tasks, should be created with input from interdisciplinary Shot Clinical Information Extractors. In 2022 Conference on Empirical Methods in Natural
teams of clinicians, computer scientists, and medical researchers. OSCEs Language Processing (EMNLP). 1998–2022 (ACL, 2022).
3. Brown, T. B. et al. Language Models are Few-Shot Learners. In Proc. NeurIPS 2020. (2020).
typically consist of long lists of processes or diagnoses students are graded 4. Bubeck, S. et al. Sparks of Artificial General Intelligence: Early experiments with GPT-4 Preprint
on. Similarly, AI-SCE benchmarks would extend beyond traditional com- at https://doi.org/10.48550/arXiv.2303.12712 (2023).
puter science metrics, such as BLEU or ROUGE scores, that often do not 5. Lee, P., Bubeck, S. & Petro, J. Benefits, Limits, and Risks of GPT-4 as an AI Chatbot for
Medicine. N. Engl. J. Med. 388, 1233–1239 (2023).
account for semantic meaning, and would draw from preexisting multi-turn 6. Fleming, S. L. et al. Assessing the Potential of USMLE-Like Exam Questions Generated by GPT-
benchmarks26. 4. 2023.04.25.23288588. Preprint at https://doi.org/10.1101/2023.04.25.23288588 (2023).
The AI-SCE format should be used to evaluate both the outputs of 7. Nori, H., King, N., McKinney, S. M., Carignan, D. & Horvitz, E. Capabilities of GPT-4 on Medical
Challenge Problems. Preprint at https://doi.org/10.48550/arXiv.2303.13375 (2023).
high-fidelity agent simulations, and intermediate steps that capture the 8. Dash, D. et al. Evaluation of GPT-3.5 and GPT-4 for supporting real-world information needs in
agent’s reasoning process, tool usage, data curation, or interactions with healthcare delivery. Preprint at https://doi.org/10.48550/arXiv.2304.13714 (2023).
other agents or external users. Thus, a valuable contribution of these agents 9. Park, J. S. et al. Generative Agents: Interactive Simulacra of Human Behavior. In 36th
Symposium on User Interface Software and Technology (UIST). 1–22 (ACM, 2023).
is their ability to provide interpretability throughout the decision-making 10. Yang, H., Yue, S. & He, Y. Auto-GPT for Online Decision Making: Benchmarks and Additional
process, as opposed to at the final step27. These evaluations can also capture Opinions. Preprint at https://doi.org/10.48550/arXiv.2306.02224 (2023).

npj Digital Medicine | (2024)7:84 2


npj | digital medicine Comment

11. Johri, S. et al. Testing the Limits of Language Models: A Conversational Framework for Medical Competing interests
AI Assessment. medRxiv https://www.medrxiv.org/content/10.1101/2023.09.12. A.J.B. is a co-founder and consultant to Personalis and NuMedii; consultant to Mango Tree Cor-
23295399v2 (2023). poration, and in the recent past, Samsung, 10x Genomics, Helix, Pathway Genomics, and Verinata
12. Introducing Dr. Chatbot (2023). https://today.ucsd.edu/story/introducing-dr-chatbot. (Illumina); has served on paid advisory panels or boards for Geisinger Health, Regenstrief Institute,
13. Levine, D. M. et al. The Diagnostic and Triage Accuracy of the GPT-3 Artificial Intelligence Gerson Lehman Group, AlphaSights, Covance, Novartis, Genentech, and Merck, and Roche; is a
Model. Preprint at https://doi.org/10.1101/2023.01.30.23285067 (2023). shareholder in Personalis and NuMedii; is a minor shareholder in Apple, Meta (Facebook), Alphabet
14. Korngiebel, D. M. & Mooney, S. D. Considering the possibilities and pitfalls of Generative Pre- (Google), Microsoft, Amazon, Snap, 10x Genomics, Illumina, Regeneron, Sanofi, Pfizer, Royalty
trained Transformer 3 (GPT-3) in healthcare delivery. Npj Digit. Med. 4, 1–3 (2021). Pharma, Moderna, Sutro, Doximity, BioNtech, Invitae, Pacific Biosciences, Editas Medicine, Nuna
15. Bankes, S. C. Agent-based modeling: A revolution? PNAS. https://doi.org/10.1073/pnas. Health, Assay Depot, and Vet24seven, and several other non-health related companies and mutual
072081299. funds; and has received honoraria and travel reimbursement for invited talks from Johnson and
16. Tracy, M., Cerdá, M. & Keyes, K. M. Agent-Based Modeling in Public Health: Current Johnson, Roche, Genentech, Pfizer, Merck, Lilly, Takeda, Varian, Mars, Siemens, Optum, Abbott,
Applications and Future Directions. Annu. Rev. Public Health 39, 77–94 (2018). Celgene, AstraZeneca, AbbVie, Westat, and many academic institutions, medical or disease specific
17. Bonabeau, E. Agent-based modeling: Methods and techniques for simulating human systems. foundations and associations, and health systems. A.J.B. receives royalty payments through Stanford
Proc. Natl. Acad. Sci. 99, 7280–7287 (2002). University, for several patents and other disclosures licensed to NuMedii and Personalis. A.J.B.’s
18. Fagnant, D. J. & Kockelman, K. M. The travel and environmental implications of shared research has been funded by NIH, Peraton (as the prime on an NIH contract), Genentech, Johnson and
autonomous vehicles, using agent-based model scenarios. Transp. Res. Part C. Emerg. Johnson, FDA, Robert Wood Johnson Foundation, Leon Lowenstein Foundation, Intervalien Foun-
Technol. 40, 1–13 (2014). dation, Priscilla Chan and Mark Zuckerberg, the Barbara and Gerson Bakar Foundation, and in the
19. Kaur, P. et al. A survey on simulators for testing self-driving cars. In 2021 Fourth International recent past, the March of Dimes, Juvenile Diabetes Research Foundation, California Governor’s Office
Conference on Connected and Autonomous Driving (MetroCAD) (IEEE, 2021). of Planning and Research, California Institute for Regenerative Medicine, L’Oreal, and Progenity. None
20. Radhakrishnan, L. et al. A certified de-identification system for all clinical text documents for of these entities had any bearing on the design of this study or the writing of the manuscript. All other
information extraction at scale. JAMIA Open 6, ooad045 (2023). authors have no conflicts of interest to disclose.
21. Johnson, A. E. W. et al. MIMIC-IV, a freely accessible electronic health record dataset. Sci. Data
10, 1 (2023).
22. Zayyan, M. Objective Structured Clinical Examination: The Assessment of Choice. Oman Med. Additional information
J. 26, 219–222 (2011). Correspondence and requests for materials should be addressed to Ahmed Alaa.
23. Tu, et al. Towards Conversational Diagnostic AI. Preprint at https://arxiv.org/abs/2401.
05654 (2024). Reprints and permissions information is available at
24. Wornow, M. et al. The shaky foundations of large language models and foundation models for http://www.nature.com/reprints
electronic health records. Npj Digit. Med. 6, 1–10 (2023).
25. Singhal, K. et al. Large language models encode clinical knowledge. Nature 620, Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps
172–180 (2023). and institutional affiliations.
26. Shen, H., et al. MultiTurnCleanup: A Benchmark for Multi-Turn Spoken Conversational
Transcript Cleanup. In Proceedings of the 2023 Conference on Empirical Methods in Natural
Language Processing (EMNLP). 9895–9903. (ACL, 2023). Open Access This article is licensed under a Creative Commons Attribution 4.0 International
27. Chen, I. et al. Ethical machine learning in healthcare. Annu. Rev. Biomed. Data Sci. 4, License, which permits use, sharing, adaptation, distribution and reproduction in any medium or
123–144 (2021). format, as long as you give appropriate credit to the original author(s) and the source, provide a link
28. Rebedea, Traian, et al. "NeMo Guardrails: A Toolkit for Controllable and Safe LLM Applications to the Creative Commons licence, and indicate if changes were made. The images or other third
with Programmable Rails." Proceedings of the 2023 Conference on Empirical Methods in party material in this article are included in the article’s Creative Commons licence, unless indicated
Natural Language Processing: System Demonstrations. 2023. otherwise in a credit line to the material. If material is not included in the article’s Creative Commons
29. Webster, P. Six ways large language models are changing healthcare. Nat. Med., 29, licence and your intended use is not permitted by statutory regulation or exceeds the permitted use,
2969–2971 (2023). you will need to obtain permission directly from the copyright holder. To view a copy of this licence,
visit http://creativecommons.org/licenses/by/4.0/.

Author contributions
© The Author(s) 2024
N.M., B.Y.M., E.R.A., A.J.B., and A.A. were involved in the conception of the paper and writing of the
original draft. All authors were involved in the reviewing, revising, and editing of the final draft. All first
co-authors made equal contribution.

npj Digital Medicine | (2024)7:84 3

You might also like