Evaluating large language models as agents，自然医学

Uploaded by

jinbo zhang

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

12 views

Evaluating large language models as agents，自然医学

Uploaded by

jinbo zhang

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 3

npj | digital medicine Comment

Published in partnership with Seoul National University Bundang Hospital

https://doi.org/10.1038/s41746-024-01083-y

Evaluating large language models as agents

in the clinic
Nikita Mehandru, Brenda Y. Miao, Eduardo Rodriguez Almaraz, Madhumita Sushil,
Atul J. Butte & Ahmed Alaa Check for updates

Recent developments in large language models However, the emergent capabilities of LLMs have signiﬁcantly
(LLMs) have unlocked opportunities for healthcare, expanded their potential beyond conventional, standardized clinical natural
language processing (NLP) tasks that primarily revolve around text pro-
from information synthesis to clinical decision cessing and question answering. Instead, there is a growing emphasis on
support. These LLMs are not just capable of utilizing LLMs for more complex physician- and patient-facing tasks that
may involve multi-step information synthesis, use of external data sources,
modeling language, but can also act as intelligent high-level reasoning, or even simulation of clinical text and conversations8,9.
“agents” that interact with stakeholders in open- In these scenarios, LLMs should not be viewed as models of language,
1234567890():,;
1234567890():,;

ended conversations and even influence clinical but rather as intelligent “agents” that have internal planning capabilities that
allow them to perform complex, multi-step reasoning or interact with tools,
decision-making. Rather than relying on databases, other agents, or external users to better respond to user
benchmarks that measure a model’s ability to requests9,10. Here, we discuss how LLM agents can be used in clinical settings,
and challenges to the development and evaluation of these approaches.
process clinical data or answer standardized test
questions, LLM agents can be modeled in high- Development of LLM agents for clinical use
fidelity simulations of clinical settings and should be LLM agents can be developed for a variety of clinical use cases by providing
the LLM access to different sources of information and tools, including
assessed for their impact on clinical workflows. clinical guidelines, databases containing electronic health records, clinical
These evaluation frameworks, which we refer to as calculators, or other curated clinical software tools9,10. These agents can
respond to user requests by autonomously identifying and retrieving rele-
“Artificial Intelligence Structured Clinical vant information, or performing multi-step analyses to answer questions,
Examinations” (“AI-SCE”), can draw from model data, or produce visualizations. Different agents can also even interact
comparable technologies where machines operate and collaborate with each other in “multi-agent” settings to identify or check
proposed solutions to difficult problems, or to model medical conversations
with varying degrees of self-governance, such as and decision-making processes11.
self-driving cars, in dynamic environments with Healthcare systems are already adopting LLMs capable of powering
clinical agents; for instance, UC San Diego Health is working to integrate
multiple stakeholders. Developing these robust, GPT-4 into MyChart, Epic’s online health portal, to streamline patient
real-world clinical evaluations will be crucial messaging12. Patients also leverage publicly available chatbots (such as
towards deploying LLM agents in medical settings. ChatGPT) to better understand medical vocabulary from clinical notes, and
some medical centers are exploring a “virtual-first” approach where LLMs
The release of ChatGPT, a chatbot powered by a large language model assist in patient triaging13,14. When connected to additional sources of
(LLM), has brought LLMs into the spotlight and unlocked opportunities for information and tools, the versatility and adaptability of clinical agents make
their use in healthcare settings. Med-PaLM 2, Google’s medical LLM, was them well-suited in supporting both routine administrative tasks as well as
found to consistently perform at a human expert level on medical exam- clinical decision support.
ination questions scoring 85%1. While this model, part of Google’s family of
foundation models known as MedLM, are fine-tuned for the healthcare Clinical simulations using agent-based modeling (ABM)
industry, even large LLMs trained on openly available information from the To evaluate the utility and safety of LLM-based chatbots as agents in these
Internet, not just biomedical information, have immense potential to applications, we suggest the use of benchmarks that are not confined to
improve and augment clinical workflows2–4. For instance, the Generative traditional, narrowly-scoped assessments based on NLP benchmarks, which
Pre-trained Transformer-4 (GPT-4) model can generate summaries of consist of predetermined inputs and ground-truths. Instead, approaches
physician–patient encounters from transcripts of conversations5, achieve a from agent-based modeling (ABM)15 can be used to create a simulated
score of 86% on the United States Medical Licensing Examination environment for effective evaluation of LLMs agents. ABM is a computa-
(USMLE)6, and create clinical question-answer pairs that are largely indis- tional framework that simulates the actions and interactions of autonomous
tinguishable from human-generated USMLE questions7. These early agents to provide insights into system-level behavior and outcomes. This
demonstrations of GPT-4 and other LLMs on clinical tasks and benchmarks approach has been used in health policy, biology, and the social sciences to
suggest that these models have the potential to improve and automate conduct studies that simulate health behaviors and the spread of infectious
aspects of clinical tasks. diseases16,17.

npj Digital Medicine | (2024)7:84 1

npj | digital medicine Comment

ABM has also been used to evaluate autonomous agents in the domain how systematic addition or removal of LLM agents affects overall outcomes.
of self-driving cars18. In this field, simulations of real-world environments These evaluations should be used to inform guardrails for clinical LLMs,
containing road obstacles, traffic signals, other cars, and pedestrians can be which have been developed for general-purpose models to constrain their
used to evaluate and refine the behaviors of autonomous vehicle agents as behavior28.
they encounter these different elements19. Similarly, by simulating the One added complexity of assessing agents using an AI-SCE format is
clinical settings where LLM agents may be deployed, including patient- the complicated nature of many clinical tasks, where there may not be
physician interactions and hospital processes, we can use an ABM approach perfect concordance with individual human evaluators. We emphasize the
to evaluate how an LLM agent may interact with users, which tools or data continued need for a panel of human evaluators, and the importance of
an LLM employs to carry out user requests, and points of failure that lead to testing agent outcomes on external datasets. We also recognize the
erroneous outputs or downstream errors. importance of post-deployment monitoring to ensure data distribution
Interestingly, patients and physicians can also be simulated as LLM shifts do not occur over time, and to mitigate bias in model performance25.
agents in ABM environments. Previous research has demonstrated the Furthermore, randomized control trials (RCTs) should be conducted to
feasibility of employing LLMs to create “interactive simulacra” that replicate compare how well these simulation environments capture real-world set-
human behavior9–11. To develop these high-fidelity simulations, data on tings, as well as the real-world impact of LLM agents in augmenting clinical
physician and patient behavior can be derived from real-world electronic workflows.
health records or clinical trial data, ideally with validation from multiple As LLMs evolve and demonstrate increasingly advanced cap-
hospital systems, and encompassing diverse patient populations. De- abilities, their involvement in clinical practice will extend beyond
identified datasets (e.g., MIMIC-IV, UCSF Information Commons) or limited text processing tasks29. In the near future, it may become
federated learning approaches can be used to help protect patient necessary to shift our benchmarks from static datasets to dynamic
privacy20,21. simulation environments and transition from language modeling to
agent modeling. Drawing inspiration from fields such as biology and
Evaluating agent-based simulations using an AI-SCE economics could be beneficial for future LLM research and develop-
framework ment for clinical applications.
Similar to standards and regulations for the autonomous driving industry,
identifying robust clinical guidelines and what constitutes a successful Nikita Mehandru 1,6, Brenda Y. Miao2,6,
interaction for healthcare LLM agents will be crucial towards fulfilling the Eduardo Rodriguez Almaraz 3,4,6, Madhumita Sushil 2,
long-term goals of patients, providers, and other clinical stakeholders. In Atul J. Butte 2,5 & Ahmed Alaa 1,2
1
medical education, there has been a shift from assessing students using University of California, Berkeley, 2195 Hearst Ave, Warren Hall Suite,
standardized testing which evaluates shallow clinical reasoning to modern 120C, Berkeley, CA, USA. 2Bakar Computational Health Sciences Institute,
curricula which increasingly use Objective Structured Clinical Examination University of California San Francisco, San Francisco, CA, USA.
(OSCE)22. These exams assess a student’s practical skills in the clinic, 3
Neurosurgery Department Division of Neuro-Oncology, University of
including the ability to examine patients, take clinical histories, commu- California San Francisco, 400 Parnassus Avenue, 8th floor, RM A808, San
nicate effectively, and handle unexpected situations. Google recently Francisco, CA, USA. 4Department of Epidemiology and Biostatistics,
developed Articulate Medical Intelligence Explorer (AMIE), a research AI University of California San Francisco, 400 Parnassus Avenue, 8th floor,
system for diagnostic medical reasoning and conversations, which was RM A808, San Francisco, CA, USA. 5Department of Pediatrics, University
evaluated against the performance of primary care physicians (PCPs) in the of California San Francisco, San Francisco, CA, USA. 6These authors
style of an OSCE23. contributed equally: Nikita Mehandru, Brenda Y. Miao, Eduardo Rodriguez
Current benchmarks for clinical NLP, including MedQA (USMLE- Almaraz. e-mail: [email protected]
style questions) and MedNLI, test if two clinical statements logically follow
each other and are often also derived from standardized tests or curated Received: 25 August 2023; Accepted: 22 March 2024;
clinical text. This information; however, is not a sufficient metric because it
fails to capture the full range of capabilities demonstrated by clinical LLM
agents24,25. As a result, we call for the development of Artificial Intelligence
Structured Clinical Examinations (AI-SCEs) that can be used to assess the References
ability for LLMs to aid in real-world clinical workflows. These AI-SCE 1. Singhal, et al. Towards expert-level medical question answering with large language models.
Preprint at https://arxiv.org/abs/2305.09617 (2023).
benchmarks, which may be derived from difficult clinical scenarios or from 2. Agrawal, M., Hegselmann, S., Lang, H., Kim, Y. & Sontag, D. Large Language Models are Few-
real-world clinical tasks, should be created with input from interdisciplinary Shot Clinical Information Extractors. In 2022 Conference on Empirical Methods in Natural
teams of clinicians, computer scientists, and medical researchers. OSCEs Language Processing (EMNLP). 1998–2022 (ACL, 2022).
3. Brown, T. B. et al. Language Models are Few-Shot Learners. In Proc. NeurIPS 2020. (2020).
typically consist of long lists of processes or diagnoses students are graded 4. Bubeck, S. et al. Sparks of Artificial General Intelligence: Early experiments with GPT-4 Preprint
on. Similarly, AI-SCE benchmarks would extend beyond traditional com- at https://doi.org/10.48550/arXiv.2303.12712 (2023).
puter science metrics, such as BLEU or ROUGE scores, that often do not 5. Lee, P., Bubeck, S. & Petro, J. Benefits, Limits, and Risks of GPT-4 as an AI Chatbot for
Medicine. N. Engl. J. Med. 388, 1233–1239 (2023).
account for semantic meaning, and would draw from preexisting multi-turn 6. Fleming, S. L. et al. Assessing the Potential of USMLE-Like Exam Questions Generated by GPT-
benchmarks26. 4. 2023.04.25.23288588. Preprint at https://doi.org/10.1101/2023.04.25.23288588 (2023).
The AI-SCE format should be used to evaluate both the outputs of 7. Nori, H., King, N., McKinney, S. M., Carignan, D. & Horvitz, E. Capabilities of GPT-4 on Medical
Challenge Problems. Preprint at https://doi.org/10.48550/arXiv.2303.13375 (2023).
high-fidelity agent simulations, and intermediate steps that capture the 8. Dash, D. et al. Evaluation of GPT-3.5 and GPT-4 for supporting real-world information needs in
agent’s reasoning process, tool usage, data curation, or interactions with healthcare delivery. Preprint at https://doi.org/10.48550/arXiv.2304.13714 (2023).
other agents or external users. Thus, a valuable contribution of these agents 9. Park, J. S. et al. Generative Agents: Interactive Simulacra of Human Behavior. In 36th
Symposium on User Interface Software and Technology (UIST). 1–22 (ACM, 2023).
is their ability to provide interpretability throughout the decision-making 10. Yang, H., Yue, S. & He, Y. Auto-GPT for Online Decision Making: Benchmarks and Additional
process, as opposed to at the final step27. These evaluations can also capture Opinions. Preprint at https://doi.org/10.48550/arXiv.2306.02224 (2023).

npj Digital Medicine | (2024)7:84 2

npj | digital medicine Comment

11. Johri, S. et al. Testing the Limits of Language Models: A Conversational Framework for Medical Competing interests
AI Assessment. medRxiv https://www.medrxiv.org/content/10.1101/2023.09.12. A.J.B. is a co-founder and consultant to Personalis and NuMedii; consultant to Mango Tree Cor-
23295399v2 (2023). poration, and in the recent past, Samsung, 10x Genomics, Helix, Pathway Genomics, and Verinata
12. Introducing Dr. Chatbot (2023). https://today.ucsd.edu/story/introducing-dr-chatbot. (Illumina); has served on paid advisory panels or boards for Geisinger Health, Regenstrief Institute,
13. Levine, D. M. et al. The Diagnostic and Triage Accuracy of the GPT-3 Artificial Intelligence Gerson Lehman Group, AlphaSights, Covance, Novartis, Genentech, and Merck, and Roche; is a
Model. Preprint at https://doi.org/10.1101/2023.01.30.23285067 (2023). shareholder in Personalis and NuMedii; is a minor shareholder in Apple, Meta (Facebook), Alphabet
14. Korngiebel, D. M. & Mooney, S. D. Considering the possibilities and pitfalls of Generative Pre- (Google), Microsoft, Amazon, Snap, 10x Genomics, Illumina, Regeneron, Sanofi, Pfizer, Royalty
trained Transformer 3 (GPT-3) in healthcare delivery. Npj Digit. Med. 4, 1–3 (2021). Pharma, Moderna, Sutro, Doximity, BioNtech, Invitae, Pacific Biosciences, Editas Medicine, Nuna
15. Bankes, S. C. Agent-based modeling: A revolution? PNAS. https://doi.org/10.1073/pnas. Health, Assay Depot, and Vet24seven, and several other non-health related companies and mutual
072081299. funds; and has received honoraria and travel reimbursement for invited talks from Johnson and
16. Tracy, M., Cerdá, M. & Keyes, K. M. Agent-Based Modeling in Public Health: Current Johnson, Roche, Genentech, Pfizer, Merck, Lilly, Takeda, Varian, Mars, Siemens, Optum, Abbott,
Applications and Future Directions. Annu. Rev. Public Health 39, 77–94 (2018). Celgene, AstraZeneca, AbbVie, Westat, and many academic institutions, medical or disease specific
17. Bonabeau, E. Agent-based modeling: Methods and techniques for simulating human systems. foundations and associations, and health systems. A.J.B. receives royalty payments through Stanford
Proc. Natl. Acad. Sci. 99, 7280–7287 (2002). University, for several patents and other disclosures licensed to NuMedii and Personalis. A.J.B.’s
18. Fagnant, D. J. & Kockelman, K. M. The travel and environmental implications of shared research has been funded by NIH, Peraton (as the prime on an NIH contract), Genentech, Johnson and
autonomous vehicles, using agent-based model scenarios. Transp. Res. Part C. Emerg. Johnson, FDA, Robert Wood Johnson Foundation, Leon Lowenstein Foundation, Intervalien Foun-
Technol. 40, 1–13 (2014). dation, Priscilla Chan and Mark Zuckerberg, the Barbara and Gerson Bakar Foundation, and in the
19. Kaur, P. et al. A survey on simulators for testing self-driving cars. In 2021 Fourth International recent past, the March of Dimes, Juvenile Diabetes Research Foundation, California Governor’s Office
Conference on Connected and Autonomous Driving (MetroCAD) (IEEE, 2021). of Planning and Research, California Institute for Regenerative Medicine, L’Oreal, and Progenity. None
20. Radhakrishnan, L. et al. A certified de-identification system for all clinical text documents for of these entities had any bearing on the design of this study or the writing of the manuscript. All other
information extraction at scale. JAMIA Open 6, ooad045 (2023). authors have no conflicts of interest to disclose.
21. Johnson, A. E. W. et al. MIMIC-IV, a freely accessible electronic health record dataset. Sci. Data
10, 1 (2023).
22. Zayyan, M. Objective Structured Clinical Examination: The Assessment of Choice. Oman Med. Additional information
J. 26, 219–222 (2011). Correspondence and requests for materials should be addressed to Ahmed Alaa.
23. Tu, et al. Towards Conversational Diagnostic AI. Preprint at https://arxiv.org/abs/2401.
05654 (2024). Reprints and permissions information is available at
24. Wornow, M. et al. The shaky foundations of large language models and foundation models for http://www.nature.com/reprints
electronic health records. Npj Digit. Med. 6, 1–10 (2023).
25. Singhal, K. et al. Large language models encode clinical knowledge. Nature 620, Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps
172–180 (2023). and institutional affiliations.
26. Shen, H., et al. MultiTurnCleanup: A Benchmark for Multi-Turn Spoken Conversational
Transcript Cleanup. In Proceedings of the 2023 Conference on Empirical Methods in Natural
Language Processing (EMNLP). 9895–9903. (ACL, 2023). Open Access This article is licensed under a Creative Commons Attribution 4.0 International
27. Chen, I. et al. Ethical machine learning in healthcare. Annu. Rev. Biomed. Data Sci. 4, License, which permits use, sharing, adaptation, distribution and reproduction in any medium or
123–144 (2021). format, as long as you give appropriate credit to the original author(s) and the source, provide a link
28. Rebedea, Traian, et al. "NeMo Guardrails: A Toolkit for Controllable and Safe LLM Applications to the Creative Commons licence, and indicate if changes were made. The images or other third
with Programmable Rails." Proceedings of the 2023 Conference on Empirical Methods in party material in this article are included in the article’s Creative Commons licence, unless indicated
Natural Language Processing: System Demonstrations. 2023. otherwise in a credit line to the material. If material is not included in the article’s Creative Commons
29. Webster, P. Six ways large language models are changing healthcare. Nat. Med., 29, licence and your intended use is not permitted by statutory regulation or exceeds the permitted use,
2969–2971 (2023). you will need to obtain permission directly from the copyright holder. To view a copy of this licence,
visit http://creativecommons.org/licenses/by/4.0/.

Author contributions
© The Author(s) 2024
N.M., B.Y.M., E.R.A., A.J.B., and A.A. were involved in the conception of the paper and writing of the
original draft. All authors were involved in the reviewing, revising, and editing of the ﬁnal draft. All ﬁrst
co-authors made equal contribution.

npj Digital Medicine | (2024)7:84 3

LLMs and Generative AI For (Z-Library)
100% (3)
LLMs and Generative AI For (Z-Library)
58 pages
James Stacey Taylor - The Metaphysics and Ethics of Death - New Essays-Oxford University Press (2013)
100% (1)
James Stacey Taylor - The Metaphysics and Ethics of Death - New Essays-Oxford University Press (2013)
286 pages
Teamcenter Integration For ProENGINEER 10.1.0 InstallationGuide
100% (2)
Teamcenter Integration For ProENGINEER 10.1.0 InstallationGuide
61 pages
LLM-Based Framework for Administrative Task Automation in Healthcare
No ratings yet
LLM-Based Framework for Administrative Task Automation in Healthcare
7 pages
mini base paper
No ratings yet
mini base paper
7 pages
Decoding ChatGPT A Primer On Large Language Models For Clinicians
No ratings yet
Decoding ChatGPT A Primer On Large Language Models For Clinicians
4 pages
USMLE Exam
No ratings yet
USMLE Exam
15 pages
LLM Evaluation
No ratings yet
LLM Evaluation
5 pages
The Imperative For Regulatory Oversight of Large Language Models (Or Generative AI) in Healthcare
No ratings yet
The Imperative For Regulatory Oversight of Large Language Models (Or Generative AI) in Healthcare
6 pages
A framework for human evaluation of large language models in healthcare derived from literature review
No ratings yet
A framework for human evaluation of large language models in healthcare derived from literature review
20 pages
JAMA Health 1695421152
No ratings yet
JAMA Health 1695421152
3 pages
s10916-024-02045-3
No ratings yet
s10916-024-02045-3
11 pages
5
No ratings yet
5
9 pages
2
No ratings yet
2
11 pages
2503.22678v1
No ratings yet
2503.22678v1
14 pages
Proposals for LLM implementations
No ratings yet
Proposals for LLM implementations
12 pages
A Survey of Large Language Models in Medicine Progress, Application, and Challenge 2024 [highlight]
No ratings yet
A Survey of Large Language Models in Medicine Progress, Application, and Challenge 2024 [highlight]
29 pages
LLM Medicine
No ratings yet
LLM Medicine
11 pages
Adaptive Reasoning and Acting in Medical Language Agents
No ratings yet
Adaptive Reasoning and Acting in Medical Language Agents
9 pages
LLM-based Agentic Systems in Medicine And Healthcare Jianing Qiu Kyle Lam Guohao Li Amish Acharya Tien Yin Wong Ara Darzi Wu Yuan Eric J. Topol Nature Machine Intelligence December 2024
No ratings yet
LLM-based Agentic Systems in Medicine And Healthcare Jianing Qiu Kyle Lam Guohao Li Amish Acharya Tien Yin Wong Ara Darzi Wu Yuan Eric J. Topol Nature Machine Intelligence December 2024
3 pages
1__2024_Towards Next-Generation Medical Agent_How o1 is Reshaping Decision-Making in Medical Scenarios
No ratings yet
1__2024_Towards Next-Generation Medical Agent_How o1 is Reshaping Decision-Making in Medical Scenarios
40 pages
1
No ratings yet
1
48 pages
Preprints202409 0311 v1
No ratings yet
Preprints202409 0311 v1
19 pages
3
No ratings yet
3
5 pages
Manuscript Aih03712
No ratings yet
Manuscript Aih03712
10 pages
41-s2.0-S1386505624001643-main
No ratings yet
41-s2.0-S1386505624001643-main
15 pages
3__2024_ClinicalLab_Aligning Agents for Multi-Departmental Clinical Diagnostics in the Real World
No ratings yet
3__2024_ClinicalLab_Aligning Agents for Multi-Departmental Clinical Diagnostics in the Real World
46 pages
zhang2023huatuogpt
No ratings yet
zhang2023huatuogpt
21 pages
Journal Pdig 0000198
No ratings yet
Journal Pdig 0000198
12 pages
up4
No ratings yet
up4
10 pages
Leveraging LLM: Implementing An Advanced AI Chatbot For Healthcare
No ratings yet
Leveraging LLM: Implementing An Advanced AI Chatbot For Healthcare
8 pages
LLMs in medicine_accepted
No ratings yet
LLMs in medicine_accepted
46 pages
BDCC 08 00161
No ratings yet
BDCC 08 00161
15 pages
41591_2024_Article_3180
No ratings yet
41591_2024_Article_3180
9 pages
Artificial Intelligence in Healthcare
No ratings yet
Artificial Intelligence in Healthcare
5 pages
J Esthet Restor Dent - 2023 - Eggmann - Implications of Large Language Models Such As ChatGPT For Dental Medicine
No ratings yet
J Esthet Restor Dent - 2023 - Eggmann - Implications of Large Language Models Such As ChatGPT For Dental Medicine
5 pages
s41586-025-08869-4
No ratings yet
s41586-025-08869-4
19 pages
Evolution and Impact of Large Language Models in Medical practice
No ratings yet
Evolution and Impact of Large Language Models in Medical practice
12 pages
2308 01727v1
No ratings yet
2308 01727v1
12 pages
LLMs in Healthcare Revised Final PDF Report
No ratings yet
LLMs in Healthcare Revised Final PDF Report
12 pages
NEJMp2404691
No ratings yet
NEJMp2404691
3 pages
LLMs Encode Clinical Knowledge
No ratings yet
LLMs Encode Clinical Knowledge
28 pages
A_strategy_for_cost_effective_LLM_use_at_health_system__1732451014
No ratings yet
A_strategy_for_cost_effective_LLM_use_at_health_system__1732451014
12 pages
patient KG
No ratings yet
patient KG
42 pages
1-s2.0-S1076633224007840-main
No ratings yet
1-s2.0-S1076633224007840-main
11 pages
Building the AI-Enabled Medical School of the Future
No ratings yet
Building the AI-Enabled Medical School of the Future
2 pages
2024_Xu-M
No ratings yet
2024_Xu-M
13 pages
Large Language Models in Medicine: The Potentials and Pitfalls
No ratings yet
Large Language Models in Medicine: The Potentials and Pitfalls
19 pages
Large Language Models Finetuned with Diverse Medical Data and Comprehensive Evaluation
No ratings yet
Large Language Models Finetuned with Diverse Medical Data and Comprehensive Evaluation
11 pages
Can Large Language Models Replace Coding Specialis
No ratings yet
Can Large Language Models Replace Coding Specialis
17 pages
2024 04 26 24306390v1 Full
No ratings yet
2024 04 26 24306390v1 Full
23 pages
Asesoramiento Médico Generado Por IA GPT y Más Allá JAMA Opinión
No ratings yet
Asesoramiento Médico Generado Por IA GPT y Más Allá JAMA Opinión
2 pages
Radiology-GPT A Large Language Model For Radiology
No ratings yet
Radiology-GPT A Large Language Model For Radiology
16 pages
7
No ratings yet
7
28 pages
The ChatGPT (Generative Artificial Intelligence) Revolution Has Made Artificial Intelligence Approachable For Medical Professionals
No ratings yet
The ChatGPT (Generative Artificial Intelligence) Revolution Has Made Artificial Intelligence Approachable For Medical Professionals
5 pages
The Role For Policy in AI-Assisted Medical Diagnosis
No ratings yet
The Role For Policy in AI-Assisted Medical Diagnosis
3 pages
Xray,Mri,Heart
No ratings yet
Xray,Mri,Heart
24 pages
Llm
No ratings yet
Llm
51 pages
Can AI Relate: Testing Large Language Model Response For Mental Health Support
No ratings yet
Can AI Relate: Testing Large Language Model Response For Mental Health Support
15 pages
It’s Time to Bench the Medical E
No ratings yet
It’s Time to Bench the Medical E
3 pages
The Role of Large Language Models in Transforming Emergency Medicine: Scoping Review
No ratings yet
The Role of Large Language Models in Transforming Emergency Medicine: Scoping Review
16 pages
Clinical Decision Support System: Fundamentals and Applications
From Everand
Clinical Decision Support System: Fundamentals and Applications
Fouad Sabry
5/5 (1)
Preview: Kwara State University, Malete, Nigeria
No ratings yet
Preview: Kwara State University, Malete, Nigeria
24 pages
The Boeing 777 Landing Gear System
No ratings yet
The Boeing 777 Landing Gear System
2 pages
Information Retrieval - Lecture 1
No ratings yet
Information Retrieval - Lecture 1
15 pages
SP29 3 PDF
No ratings yet
SP29 3 PDF
244 pages
2026 2028 Syllabus
No ratings yet
2026 2028 Syllabus
42 pages
Investigating The Relationship Between Scoring Average and Putts Per Round Average On The PGA Tour
No ratings yet
Investigating The Relationship Between Scoring Average and Putts Per Round Average On The PGA Tour
20 pages
Corporate Tax
No ratings yet
Corporate Tax
3 pages
衍生品期末
No ratings yet
衍生品期末
2 pages
Autonomous Maintenance For Training
No ratings yet
Autonomous Maintenance For Training
150 pages
Basic Concepts About Society
No ratings yet
Basic Concepts About Society
2 pages
Quality Control of Suppositories
No ratings yet
Quality Control of Suppositories
3 pages
N Mean (x) yn Sn std. Deviation σ: Year Maximum flood discharge (m3/sec)
No ratings yet
N Mean (x) yn Sn std. Deviation σ: Year Maximum flood discharge (m3/sec)
4 pages
Methods in Parasitology: 7. SAF Method For Stool Specimen
No ratings yet
Methods in Parasitology: 7. SAF Method For Stool Specimen
18 pages
Cambridge IGCSE™: Chemistry 0620/32 March 2020
No ratings yet
Cambridge IGCSE™: Chemistry 0620/32 March 2020
10 pages
Us 4570540
No ratings yet
Us 4570540
3 pages
Access Ories-2023 Modal
No ratings yet
Access Ories-2023 Modal
131 pages
Physics
No ratings yet
Physics
21 pages
The Imperial College of Australia: SITHCCC019-Produce Cakes, Pastries and Breads Worksheets
No ratings yet
The Imperial College of Australia: SITHCCC019-Produce Cakes, Pastries and Breads Worksheets
6 pages
Godzilla vs post colonial
No ratings yet
Godzilla vs post colonial
3 pages
AWP Lab Manual 2016-17
No ratings yet
AWP Lab Manual 2016-17
89 pages
Setting Up Address Validation in Release 12
100% (1)
Setting Up Address Validation in Release 12
13 pages
UI - 2014-2015 School Fee
No ratings yet
UI - 2014-2015 School Fee
12 pages
Chapter 5 Ans
No ratings yet
Chapter 5 Ans
53 pages
Organizational Behavior 18th Edition Robbins Solutions Manual instant download
100% (1)
Organizational Behavior 18th Edition Robbins Solutions Manual instant download
47 pages
RESPT2LITERATURE
0% (1)
RESPT2LITERATURE
14 pages
Mpi-Atlas Upvc-Valves-Catalogue A4 2
No ratings yet
Mpi-Atlas Upvc-Valves-Catalogue A4 2
7 pages
Cabinados Serie F
No ratings yet
Cabinados Serie F
4 pages
Uber From Dulles To Alexandria Va - Google Search
No ratings yet
Uber From Dulles To Alexandria Va - Google Search
1 page

Evaluating large language models as agents，自然医学

Uploaded by

Evaluating large language models as agents，自然医学

Uploaded by

npj | digital medicine Comment

Published in partnership with Seoul National University Bundang Hospital

Evaluating large language models as agents

npj Digital Medicine | (2024)7:84 1

npj Digital Medicine | (2024)7:84 2

npj Digital Medicine | (2024)7:84 3

You might also like