2_emnlp_2024_Triageagent_Towards better multi-agents collaborations for large language model-based clinical triage
2_emnlp_2024_Triageagent_Towards better multi-agents collaborations for large language model-based clinical triage
(menglu,xuanw)@vt.edu; (bho2,dmren2)@childrensnational.org
Abstract Clinical Document: 13-year-old male walks into the ED with his mother on a Friday night. Mom
states, “I didn’t realize he was out of his medications for his ADHD, and I don’t want him to
miss a day.” The patient is cooperative and pleasant. VS: BP 108/72, HR 78, RR 14, T 98.6°F.
Step 1. Allocating Documents Step 2.Group-Based Classification Analysis Step 3. Confidence Report Summarization
Consensus
Report Report Final Answer
Reachout
Discussion Report
Discussion
Discussion
Figure 2: A framework of our proposed heterogeneous multi-agent collaboration method, T RIAGE AGENT, illustrates the
reasoning process through five stages when given a clinical document as input. The five stages include 1) allocating documents, 2)
group-based classification analysis, 3) confidence report summarization, 4) collaborative discussion, and 5) consensus agreement.
ical summary generation (Shaib et al., 2023). Cur- cesses, involving agents assuming specific roles
rent research on LLMs in healthcare primarily fo- (Wang et al., 2024b; Hong et al., 2023) and engag-
cuses on integrating external tools to enhance clini- ing in effective communication (Qian et al., 2023;
cal insights and refining models through instruction Wu et al., 2023; Li et al., 2023). Recent research
tuning. For instance, GeneGPT (Jin et al., 2023) has also explored improving agent performance
leverages Web APIs from the National Center for through adversarial tactics such as debates (Du
Biotechnology Information (NCBI) to access di- et al., 2023; Liang et al., 2023; Xiong et al., 2023)
verse biomedical information and then employs and negotiations (Fu et al., 2023), with innovative
GPT models for reasoning tasks. Additionally, the frameworks where agents interact competitively
methods in (Zhang et al., 2024b; Singhal et al., (Liang et al., 2023) or negotiate roles as buyers and
2022; Oniani et al., 2024; Kang et al., 2023) utilize sellers (Fu et al., 2023).
instruction tuning combined with prompt design to
adapt LLMs for various healthcare tasks, including 3 Methodology
decision support, medical question answering, and This section presents the T RIAGE AGENT frame-
disease diagnosis. work, which simulates teamwork and problem-
solving in diagnosing and planning treatment for
2.2 LLM-driven Multi-Agents Collaboration
ED cases. We explore the heterogeneous struc-
Research in both academia and industry has fo- ture formulation of the framework, as introduced
cused on autonomous agents trained in isolated, in Appendix A. We describe the five stages of
self-contained environments with limited knowl- T RIAGE AGENT’s operation stages, illustrated in
edge bases (Wang et al., 2024a). Significant ad- Figure 2 and detailed in the following procedures:
vancements have been achieved in deploying LLM- First, allocating documents, where patient clinical
based agents capable of independently sensing and records are assigned to expert agents to initiate ESI
decision-making, as detailed in (Yao et al., 2023; discussions. Second, group-based classification
Xie et al., 2023; Zhou et al., 2023). The trend has analysis, where agents are divided into two groups
shifted towards collaborative multi-agent systems, to conduct coarse and fine-grained classification.
which enhance the capabilities of LLM agents Third, confidence report summarization, where a
through iterative feedback and teamwork, as dis- summarized report is generated based on previous
cussed in (Xi et al., 2023; Wang et al., 2024b; Li analyses, including classification results and con-
et al., 2023; Beigi et al., 2024). These systems fidence scores. Fourth, collaborative discussion,
emulate human learning and decision-making pro- where agents engage in discussions over the sum-
5749
marized report, iteratively refining it according to 2 with 90% confidence and later refines the classifi-
key information, rationales, and confidence scores. cation to level 1 with 95% confidence after further
Fifth, consensus agreement, resulting in an ul- consideration.
timate, precise, and thoroughly validated revised By combining the two strategies described above,
report, highlighting the importance of collaborative T RIAGE AGENT effectively addresses the chal-
decision-making. lenges of distinguishing boundaries between ESI
levels and enhances decision-making efficiency and
3.1 Document Allocation accuracy through this collaborative effort.
Given a patient’s narrative clinical note P =
3.3 Confidence Report Summarization
{P1 , P2 , . . .}, a clinical natural language query q,
and a structured ESI handbook with level refer- In this stage, the summarizer agent As summarizes
ences R = {R1 , R2 , . . .}, this stage assigns doc- previous document classification results, includ-
uments to agents and initiates expert discussions, ing confidence scores, rationales, and supporting
as illustrated in Figure 2 Step 1. For more details evidence from (A1 ,A2 , A3 ). This step consoli-
about query, see Appendix B. dates the findings and uses each agent’s analysis
report to construct the summary prompt P romptrs ,
3.2 Group-Based Document Classification ensuring a well-supported and reliable decision.
Additionally, debates among the agents are in-
The classification stage is summarized as a func-
corporated into the summarizer’s prompt. The
tion f : (P, R, q) → C, where C represents the
summarizer then generates a synthesized report
set of hierarchically structured ESI-level categories.
by extracting key information and analyzing the
To improve classification efficiency and accuracy,
previous classifications provided by the agents.
agents are divided into two groups, employing a
This process can be mathematically formulated
direct and coarse-to-fine-grained classification, re-
as: Repo = LLM (P, R, rrs , P romptrs ), where
spectively.
Repo represents the synthesized report, P denotes
Coarse-to-Fine-Grained Classification This the patient’s clinical notes, R refers to the ESI
group comprises two agents. The first agent, A1 , handbook references, rrs is the role of the summa-
performs an initial coarse classification of the pa- rizer, and P romptrs is the guideline prompt for
tient record Pi into two broad categories: high the summarizer, including analysis reports from
(levels 1, 2, and 3) or low (levels 3, 4, and 5) , (A1 ,A2 , A3 ). The synthesized report is structured
with level 3 included in both. This can be rep- as follows: Repo=[key information; confidence
resented as A1 : Pi → {Chigh , Clow }. The sec- score; rationale; consolidated analysis]. This ap-
ond agent, A2 , then refines these broad categories proach effectively combines insights from multiple
into detailed ESI levels: if classified as Chigh , the agents, ensuring that the triage decision is based
second agent selects from {1, 2, 3}; if Clow , it se- on comprehensive and validated information. Con-
lects from {3, 4, 5}. This can be represented as: sequently, this method enhances the accuracy in-
A2 : {Chigh , Clow } → {{1, 2, 3}, {3, 4, 5}}. This terpretability and reliability of the clinical triage
two-step process reduces misclassifications and en- process. As depicted in Step 3 of Figure 2, the
hances precision. As illustrated in the left of step 2 key information includes references from the ESI
in Figure 2, the first agent,A1 (referred to as Doc- Handbook (e.g., ESI Handbook v4, Chapter 2: ESI
tor A), initially assigns a high ESI level with 90% Triage Algorithm, p. 10-13). The summarized
confidence. A2 (referred to as Doctor B) then re- report consolidates rationales and total analysis, en-
fines the high-level category to ESI level 2 with an suring that all relevant information is considered.
80% confidence score. After rounds of discussion,
3.4 Collaborative Discussion
Doctor B revises the classification to ESI level 1,
achieving a revised confidence score of 95%. In this stage, agents engage in multiple rounds
of discussions based on the synthesized summary
Direct Fine-Grained Classification This group report Repo to refine their individual classifica-
consists of a single agent, A3 , who directly as- tions. Unlike the commonly-used voting mecha-
signs ESI levels: A3 : Pi → {1, 2, 3, 4, 5}. As nism (Tang et al., 2024), T RIAGE AGENT critically
illustrated in the right of step 2 in Figure 2, A3 reflects on the classification results, reasoning, and
(referred to as Doctor C) initially assigns ESI level confidence scores, incorporating peer-provided evi-
5750
dence. Each agent Ai starts with an initial classi- Table 1: Statistics of the clinical triage dataset
Dataset Training Test-1 Test-2 Test-3
fication result Ci and confidence score Si follows # of Docs 218 72 72 72
the following process: Ai reviews the classification
results Cj , reasoning Rj , and confidence scores
Sj from every other agent Aj (j ̸= i). If Ai is single classification, leveraging their combined do-
persuaded by Aj ’s reasoning or finds Sj > Si , it main knowledge to validate the final decision. This
updates Ci to Cj with explanations. Conversely, collaborative process ensures the final decision is
if Ai rejects Aj ’s reasoning or finds Sj lower or robust and well-supported by comprehensive anal-
equal to Si ,it justifies keeping Ci . This iterative ysis. As shown in Step 5 of Figure 2, the final con-
process continues until agents reach a preliminary sensus is reached and the definitive classification is
consensus or the early stopping mechanism is trig- provided after all agents agree on the outcome.
gered. In Step 4 of Figure 2, agents participate in
collaborative discussions to resolve discrepancies
and refine the report. 4 Experimental Setup
Implementation We use GPT-3.5-Turbo (Ope- • Self-contrast (Zhang et al., 2024a) improves sta-
nAI, 2024) and GPT-4 (OpenAI et al., 2024) from bility and accuracy by contrasting different solv-
OpenAI for as our base models for the zero-shot ing perspectives and summarizing discrepancies.
experiments. Our T RIAGE AGENT framework uti- We applied self-contrast on our clinical triage
lizes the publicly open-sourced Autogen frame- dataset to analyze and reconcile conflicting clas-
work from Microsoft 1 . The temperature is 0.9, sification results as a baseline for comparison.
topk is 1.0, and the cache seed is 42. The maxi- • Self-consistency (Wang et al., 2023b) enhances
mum number of iterations is 12, and the frequency zero-shot and few-shot CoT by generating pre-
penalty is 0.1. For SCtr and SCons, we perform 8 dominant responses through multiple chain sam-
iterations with a temperature of 0.9. plings. We implemented self-consistency on our
clinical triage dataset to generate multiple re-
Baselines The performance of the
sponse chains and select the most consistent an-
T RIAGE AGENT framework is evaluated against
swers as a baseline for comparison.
several state-of-the-art baselines, including
methods that employ LLM-based planning, tool • Exchange-of-thought (Yin et al., 2023) en-
usage, and retrieval-augmented generation. ables cross-model communication and problem-
solving integration. We implemented EoT on
• Chain-of-thought (Kojima et al., 2023) integrates our clinical triage dataset to facilitate communi-
step-by-step reasoning into the prompt for LLMs. cation between various agents as a baseline for
We implemented CoT on our clinical triage comparison.
dataset as a baseline method for comparison. • MedAgent (Tang et al., 2024) is a role-playing
1
https://microsoft.github.io/autogen/ collaboration framework for medical scenarios
5752
using LLMs. We utilized MedAgent’s multi- by 9.02% and EoT by 6.25%. Table 2 outlines
agent framework on our clinical triage dataset the three variations of our TRIAGEAGENT frame-
as a baseline for comparison. work.
• T RIAGE AGENT utilizes dynamically updated
confidence scores from various reasoning per- Comparison with CoT Methods Performance
spectives supported by external evidence to en- can decline when employing overly complex Chain
hance the performance. It selects the top K most of Thought (CoT) methods. Simply stacking
confident answers for critical assessment. Agents prompts without a clear, logical sequence can re-
employ retrieval-augmented generation to refine sult in hallucinations—erroneous outputs caused
choices, ensuring minimal discordance and max- by insufficient document comprehension and mis-
imal coherence. This iterative process integrates understanding of medical terminologies. However,
multiple methodologies and specialized knowl- our approach, which integrates multi-agent role-
edge retrieval to improve decision accuracy and playing with confidence assessments, effectively
reliability. addresses these issues and demonstrates its poten-
tial as a more robust method for applying LLMs in
Evaluation Protocol Our primary evaluation clinical triage.
metric is the total discordance rate, which mea-
sures the percentage of incorrectly predicted Comparison with Single-Agent Methods Meth-
queries. This metric is critical as it provides a ods such as CoT, self-contrast and supervised learn-
comprehensive overview of the accuracy; a lower ing lack crucial interactions among multiple LLMs.
value indicates better performance. Additionally, This absence inhibits these methods from adap-
we evaluate the undertriage rate, overtriage rate, tively refining their responses, leading to subopti-
significant undertriage rate, and significant over- mal performance in triage question-answering sce-
triage rate to assess the model’s performance in narios. Consequently, their discordance rates aver-
specific areas of clinical urgency categorization. age around 38.95% on our clinical triage dataset
While these metrics are important, the total dis- when using GPT-3.5 (as illustrated in Table 2), high-
cordance rate remains the primary measure of ac- lighting the need for improvement.
curacy. Detailed definitions of the five ESI levels
5.2 Ablation Study
(I-V) and the evaluation metrics are provided in
Appendix D. Our ablation study analyzes team optimization and
external resource optimization. After determining
5 Results the optimal structure, the T RIAGE AGENT frame-
work simulates multi-role team collaboration, en-
5.1 Main Results
abling agents to acquire the necessary capabilities
We evaluate the performance of T RIAGE AGENT to effectively accomplish triage tasks.
by averaging the results across the three test sets.
The backbone LLMs in our experiments include Team Optimization Our heterogeneous frame-
GPT-3.5, GPT-4, Llama-2, and Llama-3. Table 2 work employs multiple agents, each with a specific
presents the main results with a better performance role, to achieve optimal outcomes. We explored var-
from the GPT-3.5 and GPT-4 models. Detailed re- ious configurations to enhance team performance.
sults for Llama-2-7B and Llama-3-8B are provided By adjusting the number of role-specific agents, we
in Appendix E. found that four agents provided the most optimized
structure for performance on the ESI triage dataset.
Performance Comparison with Baselines The Figure 3(a) details this optimal configuration and
performance of our framework compared to state- demonstrates how adjusting agent roles and num-
of-the-art (SOTA) methods is presented in Ta- bers enhances overall system performance.
ble 2. The T RIAGE AGENT framework surpasses
traditional prompt engineering methods, includ- External Resource Optimization We evaluate
ing supervised learning, self-contrast, and self- the ESI Handbook, PubMed, and Wikipedia for su-
consistency, with improvements of 9.25%, 8.79%, pervising our model (Figure 3(c)). The ESI Hand-
and 5.55%, respectively. It also outperforms SOTA book proved to be the most effective, significantly
multi-agent frameworks in a zero-shot setting using enhancing model performance with its targeted clin-
the GPT-3.5-turbo model, exceeding MedAgents ical diagnostic guidelines. In contrast, PubMed
5753
(a) (b) (c)
Figure 3: Ablation study results. (a), (b) and (c) show the impact of agent structure composition, optimizing the agent team with
the ESI handbook and optimizing the agent team with various external resources, respectively. The lower Total Discordance
value in this figure represents better performance.
and Wikipedia require processing extensive addi- Table 3: Time efficiency on multiple agents (average seconds
per test case)
tional data. We also analyze the optimal number of Model Test-1 Test-2 Test-3
agents using the ESI Handbook to maximize task CoT (GPT-3.5) 0.17 0.19 0.17
efficiency (Figure 3(b)). Increasing the number CoT (GPT-4) 0.16 0.19 0.17
Self-Consistency (GPT-3.5) 0.21 0.23 0.24
of agents improved model effectiveness without Three-agents (GPT-3.5) 0.55 1.01 0.58
significantly changing overall accuracy, enhancing Three-agents (GPT-4) 0.57 1.00 0.59
Four-agents (GPT-3.5) 1.28 1.36 1.41
the contextual understanding and interpretability of Four-agents (GPT-4) 1.31 1.56 1.53
medical texts while optimizing resource utilization. Five-agents (GPT-3.5) 1.55 1.56 1.49
Six-agents (GPT-3.5) 2.11 2.05 2.13
5.3 Case Study TriagAgent (GPT-3.5) 1.30 1.45 1.52
TriagAgent (GPT-4) 1.31 1.43 1.50
Error Analysis Based on our findings, we con-
duct an expert evaluation to identify key limita- Table 4: Cost efficiency on ESI datasets classification
tions and common issues in our model. As shown Model Performance #API Calls
in Figure 4, we categorize these errors into four Two-agent (GPT-3.5) 38.42% 324
Three-agent (GPT-3.5) 38.42% 486
major types. The first type of error is lack of doc- Four-agents (GPT-3.5) 30.56% 648
ument understanding. This type of error arises Four-agents (GPT-4) 18.98% 604
from insufficient medical knowledge or incorrect CoT (GPT-4) 37.40% 216
linking to ESI levels, leading to misjudgments of
clinical severity. The second type of error is mis-
retrieval of domain knowledge. Errors result from gible increase in time costs compared to a single-
inaccurately retrieving irrelevant or mismatched in- agent model. Feedback from emergency depart-
formation will compromise triage accuracy. The ment experts indicates that our method does not sig-
third type of error is confidence-based consistency nificantly affect overall time efficiency but greatly
errors. This type of errors is caused by confi- enhances decision accuracy, which is crucial for
dently providing contradictory responses or failing better patient outcomes and resource allocation.
to reach a consensus, often due to internal ineffi-
ciencies or flaws in the early stopping mechanism,
resulting in falsely assured incorrect outcomes. The Cost Analysis Operational efficiency, particu-
last type of errors is exchange of information er- larly API token usage, is crucial for our framework.
rors. This type of errors results from incorrect Table 4 compares the performance and costs of
data transfer between agents, disrupting logical se- different configurations. While T RIAGE AGENT re-
quences and leading to erroneous conclusions. See quires more API calls than a single-agent setup,
Appendix F for more details. feedback from clinical departments suggests these
costs are reasonable. Our system improves effi-
Time Analysis Time efficiency is crucial in emer- ciency by 12.54% compared to human experts man-
gency clinical triage. Table 3 compares the perfor- ually classifying documents and achieves 18.54%
mance and time costs of different methods. Our higher performance than LLMs using CoT prompts
multi-agent architecture incurs a slight but negli- operated by human experts.
5754
Limitations
15%
• Document Understanding Errors
11%
43%
•
•
Mis-retrieval of Domain Knowledge
Confidence-based Consistency Errors
In this paper, we introduce a heterogeneous
• Exchange of Information Errors multi-agent collaboration framework called
31%
T RIAGE AGENT. Despite our efforts, the frame-
Figure 4: Ratio of different categories of error cases.
work faces limitations inherent to the healthcare
industry.
Limited Expert Evaluation Our research is lim-
6 Conclusions ited by the involvement of three human experts,
restricting the scope and depth of expert evalua-
This paper introduces a novel heterogeneous multi-
tion, which may impact the generalizability of our
agent framework, T RIAGE AGENT for clinical
findings to broader clinical settings.
triage, utilizing LLM-based role-playing agents
in a multi-stage group chat setting. This zero- Cross-Institution Collaboration The complex-
shot, training-free, and interpretable framework ity of emergency departments requires advanced
comprises five significant stages. Experiments on triage systems to address patient conditions within
clinical triage datasets demonstrate our framework the same urgency levels. This need arises from var-
significantly outperforms zero-shot baselines and ing institutional conditions,protocols, and patient
experienced professionals. Case studies and hu- demographics.
man evaluations highlight areas for improvement, Workflow Integration Our model provides a fi-
such as reducing document understanding errors nal triage decision, but actual triage often involves
and knowledge misretrieval. Future research can multiple decision-making stages. Thus, it’s cru-
enhance the framework’s efficiency by improving cial to evaluate how well our system integrates into
document comprehension and correcting domain existing workflows and complements human-led
knowledge retrieval errors. Upon acceptance, we emergency care.
will release our dataset as open source. Privacy Deploying our framework necessitates
strict privacy measures and clinical worker train-
Acknowledgement ing. Processing clinical notes can expose sensitive
information, making compliance with HIPAA and
Our work is sponsored by the NSF NAIRR Pilot
GDPR.
and PSC Neocortex, Commonwealth Cyber Initia-
tive, Children’s National Hospital, Fralin Biomed- Time and Cost Efficiency T RIAGE AGENT aims
ical Research Institute (Virginia Tech), Sanghani to enhance clinical triage by automating initial pa-
Center for AI and Data Analytics (Virginia Tech), tient assessment, reducing manual workload, wait-
Virginia Tech Innovation Campus, and a generous ing times, burnout and operational costs. However,
gift from the Amazon + Virginia Tech Center for using external technologies like OpenAI’s API adds
Efficient and Robust Machine Learning. expenses and dependencies. A detailed cost-benefit
analysis is essential for sustainability and economic
Ethics Statement viability.
5755
Jin, Chang-Tien Lu, and Lifu Huang. 2024. Inter- Minki Kang, Seanie Lee, Jinheon Baek, Kenji
nalinspector i2 : Robust confidence estimation in llms Kawaguchi, and Sung Ju Hwang. 2023. Knowledge-
through internal states. augmented reasoning distillation for small language
models in knowledge-intensive tasks.
Zhenyu Bi, Sajib Acharjee Dip, Daniel Hajialigol, Sind-
hura Kommu, Hanwen Liu, Meng Lu, and Xuan Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yu-
Wang. 2024. Ai for biomedicine in the era of large taka Matsuo, and Yusuke Iwasawa. 2023. Large lan-
language models. guage models are zero-shot reasoners.
Miguel Castro and Barbara Liskov. 1999. Practical Yuan Li, Yixuan Zhang, and Lichao Sun. 2023. Metaa-
byzantine fault tolerance. In Proceedings of the Third gents: Simulating interactions of human behaviors
Symposium on Operating Systems Design and Imple- for llm-based task-oriented coordination via collabo-
mentation, pages 173–186, USA. USENIX Associa- rative generative agents.
tion. Tian Liang, Zhiwei He, Wenxiang Jiao, Xing Wang,
Yan Wang, Rui Wang, Yujiu Yang, Zhaopeng Tu, and
G Chenais, E Lagarde, and C Gil-Jardiné. 2023. Artifi- Shuming Shi. 2023. Encouraging divergent thinking
cial intelligence in emergency medicine: Viewpoint in large language models through multi-agent debate.
of current applications and foreseeable opportunities
and challenges. J Med Internet Res, 25:e40031. Chen Ling, Xujiang Zhao, Jiaying Lu, Chengyuan Deng,
Can Zheng, Junxiang Wang, Tanmoy Chowdhury,
Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Yun Li, Hejie Cui, Xuchao Zhang, Tianjiao Zhao,
Maarten Bosma, Gaurav Mishra, Adam Roberts, Amit Panalkar, Dhagash Mehta, Stefano Pasquali,
Paul Barham, Hyung Won Chung, Charles Sutton, Wei Cheng, Haoyu Wang, Yanchi Liu, Zhengzhang
Sebastian Gehrmann, Parker Schuh, Kensen Shi, Chen, Haifeng Chen, Chris White, Quanquan Gu,
Sasha Tsvyashchenko, Joshua Maynez, Abhishek Jian Pei, Carl Yang, and Liang Zhao. 2024. Domain
Rao, Parker Barnes, Yi Tay, Noam Shazeer, Vin- specialization as the key to make large language mod-
odkumar Prabhakaran, Emily Reif, Nan Du, Ben els disruptive: A comprehensive survey.
Hutchinson, Reiner Pope, James Bradbury, Jacob
Austin, Michael Isard, Guy Gur-Ari, Pengcheng Yin, Zhengliang Liu, Zihao Wu, Mengxuan Hu, Bokai Zhao,
Toju Duke, Anselm Levskaya, Sanjay Ghemawat, Lin Zhao, Tianyi Zhang, Haixing Dai, Xianyan Chen,
Sunipa Dev, Henryk Michalewski, Xavier Garcia, Ye Shen, Sheng Li, Brian Murray, Tianming Liu, and
Vedant Misra, Kevin Robinson, Liam Fedus, Denny Andrea Sikora. 2023. Pharmacygpt: The ai pharma-
Zhou, Daphne Ippolito, David Luan, Hyeontaek Lim, cist.
Barret Zoph, Alexander Spiridonov, Ryan Sepassi,
David Dohan, Shivani Agrawal, Mark Omernick, An- Danqing Ma, Meng Wang, Ao Xiang, Zongqing Qi, and
drew M. Dai, Thanumalayan Sankaranarayana Pil- Qin Yang. 2024. Transformer-based classification
lai, Marie Pellat, Aitor Lewkowycz, Erica Moreira, outcome prediction for multimodal stroke treatment.
Rewon Child, Oleksandr Polozov, Katherine Lee,
Zongwei Zhou, Xuezhi Wang, Brennan Saeta, Mark Harsha Nori, Nicholas King, Scott Mayer McKinney,
Diaz, Orhan Firat, Michele Catasta, Jason Wei, Kathy Dean Carignan, and Eric Horvitz. 2023. Capabilities
Meier-Hellstern, Douglas Eck, Jeff Dean, Slav Petrov, of gpt-4 on medical challenge problems.
and Noah Fiedel. 2022. Palm: Scaling language mod- David Oniani, Xizhi Wu, Shyam Visweswaran, Sumit
eling with pathways. Kapoor, Shravan Kooragayalu, Katelyn Polanska,
and Yanshan Wang. 2024. Enhancing large language
Yilun Du, Shuang Li, Antonio Torralba, Joshua B. models for clinical decision support by incorporating
Tenenbaum, and Igor Mordatch. 2023. Improving clinical practice guidelines.
factuality and reasoning in language models through
multiagent debate. OpenAI. 2024. Gpt-3.5 turbo documentation.
https://platform.openai.com/docs/models/
Yao Fu, Hao Peng, Tushar Khot, and Mirella Lapata. gpt-3-5-turbo. Accessed: 2024-06-15.
2023. Improving language model negotiation with
self-play and in-context learning from ai feedback. OpenAI, Josh Achiam, Steven Adler, Sandhini Agarwal,
Lama Ahmad, Ilge Akkaya, Florencia Leoni Ale-
Sirui Hong, Mingchen Zhuge, Jonathan Chen, Xiawu man, Diogo Almeida, Janko Altenschmidt, Sam Alt-
Zheng, Yuheng Cheng, Ceyao Zhang, Jinlin Wang, man, Shyamal Anadkat, Red Avila, Igor Babuschkin,
Zili Wang, Steven Ka Shing Yau, Zijuan Lin, Liyang Suchir Balaji, Valerie Balcom, Paul Baltescu, Haim-
Zhou, Chenyu Ran, Lingfeng Xiao, Chenglin Wu, ing Bao, Mohammad Bavarian, Jeff Belgum, Ir-
and Jürgen Schmidhuber. 2023. Metagpt: Meta pro- wan Bello, Jake Berdine, Gabriel Bernadett-Shapiro,
gramming for a multi-agent collaborative framework. Christopher Berner, Lenny Bogdonoff, Oleg Boiko,
Madelaine Boyd, Anna-Luisa Brakman, Greg Brock-
Qiao Jin, Yifan Yang, Qingyu Chen, and Zhiyong Lu. man, Tim Brooks, Miles Brundage, Kevin Button,
2023. Genegpt: Augmenting large language models Trevor Cai, Rosie Campbell, Andrew Cann, Brittany
with domain tools for improved access to biomedical Carey, Chelsea Carlson, Rory Carmichael, Brooke
information. Chan, Che Chang, Fotis Chantzis, Derek Chen, Sully
5756
Chen, Ruby Chen, Jason Chen, Mark Chen, Ben Chelsea Voss, Carroll Wainwright, Justin Jay Wang,
Chess, Chester Cho, Casey Chu, Hyung Won Chung, Alvin Wang, Ben Wang, Jonathan Ward, Jason Wei,
Dave Cummings, Jeremiah Currier, Yunxing Dai, CJ Weinmann, Akila Welihinda, Peter Welinder, Ji-
Cory Decareaux, Thomas Degry, Noah Deutsch, ayi Weng, Lilian Weng, Matt Wiethoff, Dave Willner,
Damien Deville, Arka Dhar, David Dohan, Steve Clemens Winter, Samuel Wolrich, Hannah Wong,
Dowling, Sheila Dunning, Adrien Ecoffet, Atty Eleti, Lauren Workman, Sherwin Wu, Jeff Wu, Michael
Tyna Eloundou, David Farhi, Liam Fedus, Niko Felix, Wu, Kai Xiao, Tao Xu, Sarah Yoo, Kevin Yu, Qim-
Simón Posada Fishman, Juston Forte, Isabella Ful- ing Yuan, Wojciech Zaremba, Rowan Zellers, Chong
ford, Leo Gao, Elie Georges, Christian Gibson, Vik Zhang, Marvin Zhang, Shengjia Zhao, Tianhao
Goel, Tarun Gogineni, Gabriel Goh, Rapha Gontijo- Zheng, Juntang Zhuang, William Zhuk, and Barret
Lopes, Jonathan Gordon, Morgan Grafstein, Scott Zoph. 2024. Gpt-4 technical report.
Gray, Ryan Greene, Joshua Gross, Shixiang Shane
Gu, Yufei Guo, Chris Hallacy, Jesse Han, Jeff Harris, Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Car-
Yuchen He, Mike Heaton, Johannes Heidecke, Chris roll L. Wainwright, Pamela Mishkin, Chong Zhang,
Hesse, Alan Hickey, Wade Hickey, Peter Hoeschele, Sandhini Agarwal, Katarina Slama, Alex Ray, John
Brandon Houghton, Kenny Hsu, Shengli Hu, Xin Schulman, Jacob Hilton, Fraser Kelton, Luke Miller,
Hu, Joost Huizinga, Shantanu Jain, Shawn Jain, Maddie Simens, Amanda Askell, Peter Welinder,
Joanne Jang, Angela Jiang, Roger Jiang, Haozhun Paul Christiano, Jan Leike, and Ryan Lowe. 2022.
Jin, Denny Jin, Shino Jomoto, Billie Jonn, Hee- Training language models to follow instructions with
woo Jun, Tomer Kaftan, Łukasz Kaiser, Ali Ka- human feedback.
mali, Ingmar Kanitscheider, Nitish Shirish Keskar,
Tabarak Khan, Logan Kilpatrick, Jong Wook Kim, Chen Qian, Xin Cong, Wei Liu, Cheng Yang, Weize
Christina Kim, Yongjik Kim, Jan Hendrik Kirch- Chen, Yusheng Su, Yufan Dang, Jiahao Li, Juyuan
ner, Jamie Kiros, Matt Knight, Daniel Kokotajlo, Xu, Dahai Li, Zhiyuan Liu, and Maosong Sun. 2023.
Łukasz Kondraciuk, Andrew Kondrich, Aris Kon- Communicative agents for software development.
stantinidis, Kyle Kosic, Gretchen Krueger, Vishal
Kuo, Michael Lampe, Ikai Lan, Teddy Lee, Jan Yujia Qin, Shengding Hu, Yankai Lin, Weize Chen,
Leike, Jade Leung, Daniel Levy, Chak Ming Li, Ning Ding, Ganqu Cui, Zheni Zeng, Yufei Huang,
Rachel Lim, Molly Lin, Stephanie Lin, Mateusz Chaojun Xiao, Chi Han, Yi Ren Fung, Yusheng Su,
Litwin, Theresa Lopez, Ryan Lowe, Patricia Lue, Huadong Wang, Cheng Qian, Runchu Tian, Kunlun
Anna Makanju, Kim Malfacini, Sam Manning, Todor Zhu, Shihao Liang, Xingyu Shen, Bokai Xu, Zhen
Markov, Yaniv Markovski, Bianca Martin, Katie Zhang, Yining Ye, Bowen Li, Ziwei Tang, Jing Yi,
Mayer, Andrew Mayne, Bob McGrew, Scott Mayer Yuzhang Zhu, Zhenning Dai, Lan Yan, Xin Cong,
McKinney, Christine McLeavey, Paul McMillan, Yaxi Lu, Weilin Zhao, Yuxiang Huang, Junxi Yan,
Jake McNeil, David Medina, Aalok Mehta, Jacob Xu Han, Xian Sun, Dahai Li, Jason Phang, Cheng
Menick, Luke Metz, Andrey Mishchenko, Pamela Yang, Tongshuang Wu, Heng Ji, Zhiyuan Liu, and
Mishkin, Vinnie Monaco, Evan Morikawa, Daniel Maosong Sun. 2023. Tool learning with foundation
Mossing, Tong Mu, Mira Murati, Oleg Murk, David models.
Mély, Ashvin Nair, Reiichiro Nakano, Rajeev Nayak,
Arvind Neelakantan, Richard Ngo, Hyeonwoo Noh, Chantal Shaib, Millicent L. Li, Sebastian Joseph, Iain J.
Long Ouyang, Cullen O’Keefe, Jakub Pachocki, Alex Marshall, Junyi Jessy Li, and Byron C. Wallace. 2023.
Paino, Joe Palermo, Ashley Pantuliano, Giambat- Summarizing, simplifying, and synthesizing medical
tista Parascandolo, Joel Parish, Emy Parparita, Alex evidence using gpt-3 (with varying success).
Passos, Mikhail Pavlov, Andrew Peng, Adam Perel-
man, Filipe de Avila Belbute Peres, Michael Petrov, Karan Singhal, Shekoofeh Azizi, Tao Tu, S. Sara Mah-
Henrique Ponde de Oliveira Pinto, Michael, Poko- davi, Jason Wei, Hyung Won Chung, Nathan Scales,
rny, Michelle Pokrass, Vitchyr H. Pong, Tolly Pow- Ajay Tanwani, Heather Cole-Lewis, Stephen Pfohl,
ell, Alethea Power, Boris Power, Elizabeth Proehl, Perry Payne, Martin Seneviratne, Paul Gamble, Chris
Raul Puri, Alec Radford, Jack Rae, Aditya Ramesh, Kelly, Nathaneal Scharli, Aakanksha Chowdhery,
Cameron Raymond, Francis Real, Kendra Rimbach, Philip Mansfield, Blaise Aguera y Arcas, Dale Web-
Carl Ross, Bob Rotsted, Henri Roussez, Nick Ry- ster, Greg S. Corrado, Yossi Matias, Katherine Chou,
der, Mario Saltarelli, Ted Sanders, Shibani Santurkar, Juraj Gottweis, Nenad Tomasev, Yun Liu, Alvin Ra-
Girish Sastry, Heather Schmidt, David Schnurr, John jkomar, Joelle Barral, Christopher Semturs, Alan
Schulman, Daniel Selsam, Kyla Sheppard, Toki Karthikesalingam, and Vivek Natarajan. 2022. Large
Sherbakov, Jessica Shieh, Sarah Shoker, Pranav language models encode clinical knowledge.
Shyam, Szymon Sidor, Eric Sigler, Maddie Simens,
Jordan Sitkin, Katarina Slama, Ian Sohl, Benjamin Haotian Sun, Yuchen Zhuang, Lingkai Kong, Bo Dai,
Sokolowsky, Yang Song, Natalie Staudacher, Fe- and Chao Zhang. 2023. Adaplanner: Adaptive plan-
lipe Petroski Such, Natalie Summers, Ilya Sutskever, ning from feedback with language models.
Jie Tang, Nikolas Tezak, Madeleine B. Thompson,
Phil Tillet, Amin Tootoonchian, Elizabeth Tseng, Rocío Sánchez-Salmerón et al. 2022. Machine learning
Preston Tuggle, Nick Turley, Jerry Tworek, Juan Fe- methods applied to triage in emergency services: A
lipe Cerón Uribe, Andrea Vallone, Arun Vijayvergiya, systematic review. International Emergency Nursing,
60:101109.
5757
Xiangru Tang, Anni Zou, Zhuosheng Zhang, Ziming Wensen Cheng, Qi Zhang, Wenjuan Qin, Yongyan
Li, Yilun Zhao, Xingyao Zhang, Arman Cohan, and Zheng, Xipeng Qiu, Xuanjing Huang, and Tao Gui.
Mark Gerstein. 2024. Medagents: Large language 2023. The rise and potential of large language model
models as collaborators for zero-shot medical reason- based agents: A survey.
ing.
Tianbao Xie, Fan Zhou, Zhoujun Cheng, Peng Shi, Lu-
Hugo Touvron, Louis Martin, Kevin Stone, Peter Al- oxuan Weng, Yitao Liu, Toh Jing Hua, Junning Zhao,
bert, Amjad Almahairi, Yasmine Babaei, Nikolay Qian Liu, Che Liu, Leo Z. Liu, Yiheng Xu, Hongjin
Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Su, Dongchan Shin, Caiming Xiong, and Tao Yu.
Bhosale, Dan Bikel, Lukas Blecher, Cristian Canton 2023. Openagents: An open platform for language
Ferrer, Moya Chen, Guillem Cucurull, David Esiobu, agents in the wild.
Jude Fernandes, Jeremy Fu, Wenyin Fu, Brian Fuller,
Cynthia Gao, Vedanuj Goswami, Naman Goyal, An- Kai Xiong, Xiao Ding, Yixin Cao, Ting Liu, and Bing
thony Hartshorn, Saghar Hosseini, Rui Hou, Hakan Qin. 2023. Examining inter-consistency of large lan-
Inan, Marcin Kardas, Viktor Kerkez, Madian Khabsa, guage models collaboration: An in-depth analysis via
Isabel Kloumann, Artem Korenev, Punit Singh Koura, debate.
Marie-Anne Lachaux, Thibaut Lavril, Jenya Lee, Di-
ana Liskovich, Yinghai Lu, Yuning Mao, Xavier Mar- LH Yao, KC Leung, CL Tsai, CH Huang, and LC Fu.
tinet, Todor Mihaylov, Pushkar Mishra, Igor Moly- 2021. A novel deep learning-based system for triage
bog, Yixin Nie, Andrew Poulton, Jeremy Reizen- in the emergency department using electronic medi-
stein, Rashi Rungta, Kalyan Saladi, Alan Schelten, cal records: Retrospective cohort study. J Med Inter-
Ruan Silva, Eric Michael Smith, Ranjan Subrama- net Res, 23(12):e27008.
nian, Xiaoqing Ellen Tan, Binh Tang, Ross Tay- Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak
lor, Adina Williams, Jian Xiang Kuan, Puxin Xu, Shafran, Karthik Narasimhan, and Yuan Cao. 2023.
Zheng Yan, Iliyan Zarov, Yuchen Zhang, Angela Fan, React: Synergizing reasoning and acting in language
Melanie Kambadur, Sharan Narang, Aurelien Ro- models.
driguez, Robert Stojnic, Sergey Edunov, and Thomas
Scialom. 2023. Llama 2: Open foundation and fine- Zhangyue Yin, Qiushi Sun, Cheng Chang, Qipeng
tuned chat models. Guo, Junqi Dai, Xuanjing Huang, and Xipeng Qiu.
2023. Exchange-of-thought: Enhancing large lan-
Lei Wang, Chen Ma, Xueyang Feng, Zeyu Zhang, Hao guage model capabilities through cross-model com-
Yang, Jingsen Zhang, Zhiyuan Chen, Jiakai Tang, munication.
Xu Chen, Yankai Lin, Wayne Xin Zhao, Zhewei Wei,
and Ji-Rong Wen. 2024a. A survey on large language Wenqi Zhang, Yongliang Shen, Linjuan Wu, Qiuying
model based autonomous agents. Peng, Jun Wang, Yueting Zhuang, and Weiming Lu.
2024a. Self-contrast: Better reflection through incon-
Weizhi Wang, Li Dong, Hao Cheng, Xiaodong Liu, sistent solving perspectives.
Xifeng Yan, Jianfeng Gao, and Furu Wei. 2023a.
Augmenting language models with long-term mem- Xinlu Zhang, Chenxin Tian, Xianjun Yang, Lichang
ory. Chen, Zekun Li, and Linda Ruth Petzold. 2024b.
Alpacare:instruction-tuned large language models for
Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc medical application.
Le, Ed Chi, Sharan Narang, Aakanksha Chowdhery,
and Denny Zhou. 2023b. Self-consistency improves Shuyan Zhou, Frank F. Xu, Hao Zhu, Xuhui Zhou,
chain of thought reasoning in language models. Robert Lo, Abishek Sridhar, Xianyi Cheng, Tianyue
Ou, Yonatan Bisk, Daniel Fried, Uri Alon, and Gra-
Zhenhailong Wang, Shaoguang Mao, Wenshan Wu, Tao ham Neubig. 2023. Webarena: A realistic web envi-
Ge, Furu Wei, and Heng Ji. 2024b. Unleashing the ronment for building autonomous agents.
emergent cognitive synergy in large language mod-
els: A task-solving agent through multi-persona self- Yuchen Zhuang, Yue Yu, Kuan Wang, Haotian Sun,
collaboration. and Chao Zhang. 2023. Toolqa: A dataset for llm
question answering with external tools.
Qingyun Wu, Gagan Bansal, Jieyu Zhang, Yiran Wu,
Beibin Li, Erkang Zhu, Li Jiang, Xiaoyun Zhang,
Shaokun Zhang, Jiale Liu, Ahmed Hassan Awadal-
lah, Ryen W White, Doug Burger, and Chi Wang.
2023. Autogen: Enabling next-gen llm applications
via multi-agent conversation.
5758
A Model Architecture Comparison with Here is the record:
Previous Work
We compare T RIAGE AGENT with representative C Dataset Construction
previous works, as shown in Figure 5. We refer to
our system as a heterogeneous multi-agent frame- The dataset was constructed by extracting relevant
work because each agent represents a different role, patient cases from the ESI handbook v4, focus-
creating a role-playing heterogeneous structure. ing on a comprehensive range of clinical scenar-
This structure simulates real-world collaboration ios. Each case was carefully reviewed and labeled
among different roles, with each agent (or role) hav- by professional human experts to ensure accuracy.
ing its unique function and task, thereby improving Each case was carefully reviewed and labeled by
the overall system’s decision-making efficiency and professional human experts to ensure accuracy. The
accuracy. The changing colors of nodes in the fig- dataset is divided into a training set and three test
ure illustrate our dynamic heterogeneous nature, sets (test-1, test-2, and test-3), with the training set
representing agents’ evolving perspectives based containing 180 cases and each test set containing
on different rounds of discussions. In the second 72 cases. We maintained fixed proportions of ESI
row, nodes represent agents at different time steps, levels in the test sets as follows: ESI-1 (12), ESI-2
arrows indicate the edges, and colors signify the (20), ESI-3 (13), ESI-4 (12), and ESI-5 (15).
roles of the agents. For the training scenarios, we have a total of 218
cases with the following distribution across ESI
Node A node represents an agent at a specific levels: ESI-1 (14), ESI-2 (92), ESI-3 (65), ESI-
time step, each with a unique role in a hetero- 4 (22), and ESI-5 (25). The explanations in the
geneous role-playing structure. This setup simu- training dataset are manually annotated by human
lates real-world collaboration, enhancing decision- experts. Detailed proportions of each ESI level in
making efficiency and accuracy by assigning dis- the training and test sets are provided in Table 1.
tinct functions and tasks to each agent. Addition- Each training and test set includes cases with de-
ally, the changing node colors in the figure high- tailed clinical notes and corresponding ESI levels.
light our framework’s dynamic nature, illustrating Although the test sets do not contain explicit expla-
how agents’ perspectives evolve as they assimilate nations for each label, they have been accurately
and process new information through subsequent annotated by human experts based on the clinical
discussion rounds. information provided. The lack of explanations in
the test set is due to the scarcity of medical data and
Edge Edges represent the communication chan- the high cost of manual labeling. This highlights
nels between nodes during multi-agent collabora- the advantage of our framework in reducing human
tion, illustrating how information flows between labor costs and improving the efficiency of medical
agents and through the system. In our LLM-agent- text classification.
based feed-forward network, these edges show how
agents share information across different time steps C.1 Recruitment and Payment
to generate the final answer for the task query.
The human experts assisting us in the classification
of medical documents are professional experts who
B Query Details
voluntarily participated in our comparative study
below: as collaborators. We did not provide them with any
query q: What is the ESI level of the following additional payments or benefits.
clinic record? Please give me a final unique answer
after a second revision of your first proposed C.2 Instructions Given to Participants
answer. You can learn from the Emergency We invited professional human experts to serve as
Severity Index handbook v4. Double-check the human annotators. The full text of the instructions
ESI handbook and ask yourself again(two-round given to participants is: "This study aims to eval-
self-check)when you are sure about this ESI level uate the effectiveness of a generative AI model in
classification before you give me the answer. Then predicting ESI levels and compare its performance
classify the following medical record according with traditional nurse triage. Your participation will
to ESI level, candidate answers are ESI-1,2,3,4,5. help us understand the potential of AI in augment-
5759
Figure 5: Topology Structure Comparisons of previous methods
5760
Method Zero-shot(CoT) F Case Study
Total discordance 56.25%
Undertriage 3.48% To better illustrate the issue, we have excerpted
Significant undertriage 3.48% parts of the complete conversation as key examples
Overtriage 52.78% to more effectively explain the corresponding types
Significant overtriage 35.42% of errors.
Method Fine-Tuned
Total discordance 27.78% •Type-1: Lack of Document Understanding
Undertriage 10.65%
Significant undertriage 7.64% Admin (to chat_manager):
Overtriage 17.13% What is the ESI level of the following clinic record?
Significant overtriage 6.94% Please give me a final unique answer after a second
Table 5: Performance of LlaMa-3-8B on our clinical triage revision of your first proposed answer. The
task. Candidate answers: ESI-1,2,3,4,5. Double-check
the clinic notes and related medical knowledge and
ask yourself again (two-round self-check) when
Lastly, the Significant Overtriage rate addresses you are sure about this ESL level classification
the instances where less critical labels (true label before you give me the answer. Here is the record:“
is 2, 3, or 4) are overestimated by the model (pre- A healthy 10-year-old male is brought to the
dicted as 1): emergency department by his mother, who reports
that her son has not moved his bowels for a week.
Predicted-1,or 5
Significant Overtriage = (5) He is complaining of 7/10 generalized abdominal
Total number of texts
pain, nausea, and lack of appetite. Vital signs: BP
107/66, HR 75, RR 20, T 98.6°F, SpO2 99%”.
E Performance of Different Backbone
LLMs Agent_1 (to chat_manager):
Based on the clinical record provided, the most
Using Different Language Models In our ex- appropriate Emergency Severity Index (ESI)
periments, we utilized GPT-3.5-turbo 2 , GPT-4 3 , level for this patient would be ESI level 3. This
their combination, and LLaMa-2-7B4 and Llama- decision is based on the patient’s presenting
3-8B 5 .When GPT-3.5-turbo and GPT-4 engaged complaints of abdominal pain, nausea, and lack
in interactive debates on AI-Triage tasks, we ob- of appetite, along with the vital signs indicating
served performance improvements of 31.88% for stable blood pressure, heart rate, respiratory rate,
GPT-3.5-turbo and 24.6% for GPT-4. In a zero-shot temperature, and oxygen saturation. While the
scenario, LLaMa-2-7B performed 17.025% worse patient’s symptoms are concerning and require
than LLaMa-3-8B. We also fine-tuned LlaMa-3-8B evaluation, there are no immediate life-threatening
using training scenarios from the ESI handbook to issues or unstable vital signs that would warrant a
enhance its domain knowledge. Compared to GPT- higher ESI level.
4, which achieved a performance improvement of Confidence score: 70%.
23.50% on the same training dataset, LLaMa-3-
8B showed a performance improvement of 27.78%, Agent_2 (to chat_manager):
making it 4.28% worse than GPT-4.Detailed results Based on a thorough review of the clinical record
for LLaMa-3 are presented in Table 5, demonstrat- and consideration of the patient’s presentation,
ing that its performance on clinical triage tasks was vital signs, and symptoms, I would revise the ESI
inferior to that of the GPT models. For details level for this patient to ESI level 2. The patient’s
about the performance of LlaMa-3 on our task, see complaints of abdominal pain, nausea, lack of
Table 5 appetite, and constipation for a week warrant
urgent evaluation, but the stable vital signs indicate
2
https://openai.com/index/ that there are no immediate life-threatening issues.
gpt-3-5-turbo-fine-tuning-and-api-updates/
3
https://openai.com/index/gpt-4/
However, the patient’s symptoms require timely
4
https://llama.meta.com/llama2/ assessment and treatment, justifying a higher
5
https://llama.meta.com/llama3/ acuity level than initially considered.
5761
Confidence score: 85%. ear while I was gardening. I tried to get it out
by using a Q-tip. I just don’t know what else to
Agent_3 (to chat_manager): do, but this buzzing noise is driving me crazy,” a
Based on the information provided in the clinical 55-year-old female tells you. No previous medical
record, a more appropriate Emergency Severity history and vital signs are within normal limits.
Index (ESI) level for this patient would be ESI
level 2. The patient is a 10-year-old male with Agent_1 (to chat_manager):
concerning symptoms of abdominal pain, nausea, Based on the information provided and the
lack of appetite, and constipation for a week. Emergency Severity Index handbook, this patient’s
While the vital signs are stable, the duration of clinic note would be categorized as Low ESI level.
symptoms and the need for urgent evaluation The patient’s condition is not life-threatening but
justify a higher acuity level. This patient may requires medical intervention, which aligns with a
require timely intervention and assessment to lower ESI level.
address the underlying cause of his symptoms, References: - ESI Handbook v4, Chapter 2: ESI
making ESI level 2 more appropriate in this case. Triage Algorithm, p.10-13.
Confidence score: 90%.
Agent_2 (to chat_manager):
Result_summarizer (to chat_manager): The scenario describes a 55-year-old female who
After synthesizing the inputs from the Agent_1, has a foreign object (a bug) in her ear, causing
Agent_2, and Agent_3, the final unique answer for discomfort and a buzzing noise. She has no
the ESI level of the patient in the clinical record previous medical history and her vital signs are
would be ESI level 2. This decision is based on the within normal limits. This is a non-life-threatening
patient being a 10-year-old male presenting with condition, but it does require medical intervention
concerning symptoms of abdominal pain, nausea, to remove the foreign object and alleviate the
lack of appetite, and constipation for a week. patient’s discomfort.
While the vital signs are stable, the duration and The Emergency Severity Index (ESI) is a five-level
nature of the symptoms warrant urgent evaluation emergency department triage algorithm that
and potential intervention, justifying a higher categorizes patients by both acuity and resource
acuity level. needs. ESI level 1 is for patients who need
Confidence score: 92%. immediate life-saving intervention, while ESI level
5 is for patients with non-urgent conditions.
Given the patient’s condition, she would likely be
•Type-2: Mis-retrieval of Domain Knowledge classified as ESI level 4. This level is for patients
with conditions that could potentially escalate to a
This example mis-retrieved the domain knowl- more serious problem requiring emergency inter-
edge from ESI Handbook v4, Chapter 2: ESI vention if left untreated, but are not life-threatening.
Triage Algorithm, p. 10-13. Then this example The patient’s condition is not life-threatening, but
was erroneously categorized as needing urgent it is causing her significant discomfort and could
intervention. potentially lead to complications such as infection
if not addressed.
User_proxy (to chat_manager): The most alarming aspect of this issue is the
What is the ESI level of the following clinic potential for complications if the foreign object is
record according to the Emergency Severity not removed promptly. The use of a Q-tip could
Index handbook? Please give me a final unique potentially push the bug further into the ear canal,
answer after a second revision from your first causing more damage. It’s also noteworthy that
proposed answer. The Candidate answers: the patient’s vital signs are within normal limits,
ESI-1,2,3,4,5. Double Check the clinic notes indicating that she is not in immediate danger.
and related medical knowledge and ask yourself After a second revision, I still believe that the ESI
again (two-round self-check) when you are sure level for this patient would be ESI level 4. The
about this ESL level classification before you patient’s condition is not life-threatening, but it
give me the answer. Here is the record:“ This does require medical intervention.
sounds really strange. A bug flew into my right
5762
Confidence score: 90%. self-check)when you are sure about this ESL level
classification before you give me the answer.Here
Agent_3 (to chat_manager): is the record: “I slipped on the ice, and I hurt my
After reviewing the patient’s condition and the wrist,” reports a 58-year-old female with a history
Emergency Severity Index handbook, I would of migraines. There is no obvious deformity. Vital
classify this patient as ESI level 4 with a confi- signs are within normal limits, and she rates her
dence score of 90%. The patient’s condition is not pain as 5/10.
life-threatening but requires medical intervention
to remove the foreign object from her ear and alle- Response 1 by chat_manager (initial response):
viate her discomfort. This aligns with the criteria Based on the patient’s description and clinical
for ESI level 4, which is for conditions that could information provided, I recommend classifying
potentially escalate to a more serious problem if this patient as ESI level 4. The injury mechanism
left untreated but are not life-threatening. and the pain level, along with normal vital signs,
suggest a non-life-threatening condition.
Agent_2 (to chat_manager): Confidence score: 90%.
Based on the patient’s clinical record and the
Emergency Severity Index handbook, I would Response 2 by chat_manager (second response
classify this patient as ESI level 4 with a confi- after self-check):
dence score of 90%. The patient’s condition is not Upon further reflection and review of the medical
life-threatening but requires medical intervention knowledge, I now assess the patient’s condition
to remove the foreign object from her ear and alle- as ESI level 3. Considering the possibility of a
viate her discomfort. This aligns with the criteria hidden injury due to the nature of the fall, a more
for ESI level 4, which is for conditions that could cautious approach is warranted.
potentially escalate to a more serious problem if Confidence score: 92%.
left untreated but are not life-threatening.
•Type-4: Exchange of Information Errors
Result_summarizer (to chat_manager):
After synthesizing the inputs from the Agent_1,
User_proxy (to chat_manager):
Agent_2, and Agent_3, the consensus is that
What is the ESI level of the following clinical
the patient should be classified as ESI level
record according to the Emergency Severity In-
4. This level is appropriate for patients with
dex handbook? Please provide a final answer after
conditions that could potentially escalate to a
a second revision from your initial proposed an-
more serious problem if left untreated but are not
swer. The candidate answers are: ESI-1, 2, 3, 4, 5.
life-threatening. The patient’s condition of having
Double-check the clinic notes and related medical
a bug in her ear is causing significant discomfort
knowledge before finalizing your decision. Here is
and requires medical intervention, but it is not
the record: “A bug flew into my right ear while I
life-threatening. Therefore, the final decision is
was gardening. I tried to get it out by using a Q-tip.
ESI level 4.
I just don’t know what else to do, but this buzzing
noise is driving me crazy,” a 55-year-old female
•Type-3: Confidence-based Consistency Er- tells you. No previous medical history and vital
rors signs are within normal limits.
This error might be caused by the early-stopping
mechanism and confidence-score passing process. Agent_1 (to chat_manager):
Based on the information provided and the
Admin (to chat_manager): Emergency Severity Index handbook, this patient’s
What is the ESI level of the following clinic clinic note would be categorized as a low ESI level.
record? Please give me a final unique answer The patient’s condition is not life-threatening
after a second revision from your first proposed but requires medical intervention, aligning with
answer. The Candidate answers: ESI-1,2,3,4,5. a lower ESI level,(e.g., ESI level 5), with a
Double Check the clinic notes and related medical confidence score of 80%.
knowledge and ask yourself again(two-round
5763
Agent_2 (to chat_manager):
The scenario describes a 55-year-old female
with a foreign object (a bug) in her ear, causing
discomfort and a buzzing noise. She has no
previous medical history, and her vital signs are
within normal limits. This is a non-life-threatening
condition but requires medical intervention to
remove the foreign object and alleviate the
patient’s discomfort.
Given the patient’s condition, she would likely be
classified as ESI level 4. This level is for patients
with conditions that could potentially escalate to
a more serious problem if left untreated but are
not life-threatening. After a second revision, I still
believe that the ESI level for this patient is ESI
level 4, with a confidence score of 90%.
5764