0% found this document useful (0 votes)
1 views18 pages

2_emnlp_2024_Triageagent_Towards better multi-agents collaborations for large language model-based clinical triage

The document introduces T RIAGE AGENT, a multi-agent framework designed to improve clinical triage processes by leveraging large language models (LLMs) for enhanced decision-making and collaboration. It addresses challenges such as data processing speed, variability in clinical documents, and the need for precise Emergency Severity Index (ESI) classifications. The framework has shown significant improvements in triage accuracy and efficiency, and it includes the first public benchmark dataset for clinical triage research.

Uploaded by

ysir3247259
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
1 views18 pages

2_emnlp_2024_Triageagent_Towards better multi-agents collaborations for large language model-based clinical triage

The document introduces T RIAGE AGENT, a multi-agent framework designed to improve clinical triage processes by leveraging large language models (LLMs) for enhanced decision-making and collaboration. It addresses challenges such as data processing speed, variability in clinical documents, and the need for precise Emergency Severity Index (ESI) classifications. The framework has shown significant improvements in triage accuracy and efficiency, and it includes the first public benchmark dataset for clinical triage research.

Uploaded by

ysir3247259
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 18

T RIAGE AGENT: Towards Better Multi-Agents Collaborations for

Large Language Model-Based Clinical Triage


Meng Lu♠ , Brandon Ho♡ , Dennis Ren♡ , Xuan Wang♠∗

Department of Computer Science, Virginia Tech, Blacksburg, VA, USA

Children’s National Hospital, Washington DC, USA

(menglu,xuanw)@vt.edu; (bho2,dmren2)@childrensnational.org

Abstract Clinical Document: 13-year-old male walks into the ED with his mother on a Friday night. Mom
states, “I didn’t realize he was out of his medications for his ADHD, and I don’t want him to
miss a day.” The patient is cooperative and pleasant. VS: BP 108/72, HR 78, RR 14, T 98.6°F.

The global escalation in emergency department


patient visits poses significant challenges to
efficient clinical management, particularly in
clinical triage. Traditionally managed by hu-
man professionals, clinical triage is suscepti- Human expert Deep Learning Models Large language Models Multi-agents framework

ble to substantial variability and high work-


loads. Although large language models (LLMs)
demonstrate promising reasoning and under-
The ESI level of Based on the …, the Report: Through discussion... I

standing capabilities, directly applying them I classify this patient


to level 4. this patient is 4. final ESI level answer is
3.
would classify this patient as
an ESI Level 5—90% confidence
score
to clinical triage remains challenging due to
Figure 1: Illustration of the clinical triage task.
the complex and dynamic nature of the clin-
ical triage task. To address these issues, we
introduce T RIAGE AGENT, a novel heteroge-
neous multi-agent framework designed to en- a standardized guide for decisions on rapid med-
hance collaborative decision-making in clini- ical intervention, which is vital for prioritizing
cal triage. T RIAGE AGENT leverages LLMs treatment and allocating resources. However, the
for role-playing, incorporating self-confidence growing number of patients poses significant chal-
and early-stopping mechanisms in multi-round
lenges to the rapid and precise classification of
discussions to improve document reasoning
and classification precision for triage tasks. In cases, which is crucial for accurate ESI categoriza-
addition, T RIAGE AGENT employs the med- tion. Currently, hospitals rely on human experts to
ical Emergency Severity Index (ESI) hand- review clinical notes and determine case urgency
book through a retrieval-augmented genera- (as illustrated in Figure 1). Although effective, this
tion (RAG) approach to provide precise clinical manual method is time-consuming, labor-intensive,
knowledge and integrates both coarse- and fine- and burdensome for clinical staff. The increasing
grained ESI-level predictions in the decision-
patient volume and complex triage process often
making process. Extensive experiments demon-
strate that T RIAGE AGENT outperforms state-
lead to staff fatigue, diminishing accuracy and effi-
of-the-art LLM-based methods on three clini- ciency, and raising the risk of inconsistent classifi-
cal triage test sets. Furthermore, we have re- cation or misdiagnosis.
leased the first public benchmark dataset for Consequently, there is a high demand for AI
clinical triage with corresponding ESI levels
methods to automate ESI classification. Traditional
and human expert performance for compari-
son. Our dataset and code can be found at
deep learning (DL) models (as illustrated in Figure
https://github.com/Lucanyc/TriageAgent. 1) (Kojima et al., 2023; Yao et al., 2021; Sánchez-
Salmerón et al., 2022) have assisted in clinical
1 Introduction triage but often fall short due to the complex and
dynamic nature of the task, which requires exten-
Emergency Departments (EDs) play a crucial role sive labeled data and real-time adaptation. LLMs
in the healthcare system by continuously assess- such as GPT (Kojima et al., 2023; OpenAI et al.,
ing and prioritizing patients based on urgency and 2024), Med-PaLM (Chowdhery et al., 2022), and
severity. This process, known as clinical triage, Llama (Touvron et al., 2023) offer promising solu-
utilizes the Emergency Severity Index (ESI) as tions with advanced text understanding capabilities,

Corresponding Author reducing time costs and errors by quickly interpret-
5747
Findings of the Association for Computational Linguistics: EMNLP 2024, pages 5747–5764
November 12-16, 2024 ©2024 Association for Computational Linguistics
ing and categorizing clinical documents. Addition- level 1 (most urgent) and a level 2 (less urgent) case
ally, LLMs can leverage external tools, such as is vital, as it determines whether a patient receives
knowledge base APIs (Qin et al., 2023; Zhuang immediate medical intervention or encounters a
et al., 2023), to enhance domain-specific knowl- prolonged wait. Nevertheless, the subtle nuances
edge, adaptability, speed, and accuracy. However, between different ESI levels present considerable
even with these tools, a single LLM may struggle challenges to the clinical triage process.
with the complexities of emergency scenarios, such To address these challenges, we propose
as diverse patient presentations and the need for im- T RIAGE AGENT, a novel heterogeneous multi-agent
mediate multi-disciplinary coordination (Chenais collaboration framework for clinical triage that
et al., 2023). Therefore, a multi-agent approach is leverages LLM-based agents enhanced with ex-
necessary to improve performance by distributing ternal tools and embedded medical knowledge.
tasks and assigning specialized roles to agents with T RIAGE AGENT enables effective information ex-
diverse expertise. change and reliable interactions among agents,
making the framework ideal for zero-shot docu-
Recent advancements have significantly en-
ment classification and handling complex tasks
hanced multi-agent systems in areas such as rea-
without prior demonstrations. The key innovations
soning (Wang et al., 2023b), sophisticated plan-
include retrieval-augmented generation for context-
ning (Yao et al., 2023; Sun et al., 2023), and
relevant evidence, a confidence score-based mech-
memory (Wang et al., 2023a). These improve-
anism for precise decision-making, and an early
ments enable multi-agent LLMs to analyze medi-
stopping mechanism to improve time efficiency.
cal data, formulate treatment plans, and recall pa-
These features enhance contextual understanding,
tient histories more effectively (Tang et al., 2024).
interpretability, and precision in ESI-level classifi-
However, directly deploying multi-agent LLMs
cation. Additionally, the framework supports real-
in clinical triage remains challenging due to their
time decision-making and dynamic, interactive de-
lack of optimization for triage-specific decision-
bates among agents, refining information for more
making, resulting in accuracy levels around 60%
accurate triage and improving the timeliness of
(as illustrated in Table 2). This suboptimal per-
clinical interventions.
formance stems from the intricate coordination re-
quired among agents and the need to design archi- Experiments on three ESI clinical triage test sets
tectures that better utilize LLMs. Additionally, the demonstrate that T RIAGE AGENT significantly im-
lack of open-source benchmark datasets hampers proves zero-shot performance with GPT-3.5-turbo
comparisons with human experts, thereby affecting and GPT-4, reducing discordance rates by up to
the practical credibility of multi-agent LLMs. 10.84% and 18.42%, respectively. Additionally,
we are the first to publicly release a clinical triage
We identify four major challenges hindering clin- dataset that includes clinical notes alongside ESI
ical triage effectiveness. The first is data process- levels and human expert performance, providing a
ing speed. The increasing patient volume neces- valuable resource for academic research and clini-
sitates rapid data processing and decision-making, cal practice. By setting new benchmarks, we aim to
yet traditional methods often face delays due to con- advance the field of clinical triage in both academic
tinuous data updates and extensive preprocessing, and practical applications. We will publicly release
hindering timely clinical interventions. The sec- our code and dataset once the paper is published.
ond is diversity in clinical documents. Variability
in patient histories, symptoms, writing styles and 2 Related Work
terminologies complicates the classification task.
The third is contextual understanding and inter- 2.1 LLMs Applications in Medical Domain
pretability. The complexity of clinical contexts Large Language Models (LLMs) have recently ex-
makes it challenging for models to accurately inter- perienced significant advancements across various
pret information and provide transparent, evidence- fields, particularly in healthcare (Ling et al., 2024;
based results, leading to a lack of clinician trust. Bi et al., 2024; Nori et al., 2023; Bao et al., 2023).
The fourth is boundaries between different ESI These models are increasingly employed in medical
levels. Precise differentiation and classification of applications, including text-based diagnostics(Ma
ESI levels (1-5) are essential to avert critical medi- et al., 2024), genetic analysis (Bi et al., 2024), phar-
cal errors. For instance, distinguishing between a maceutical applications (Liu et al., 2023), and med-
5748
Clinical Document: EMS arrives with a 28-year -old
male who was stabbed in the left side of his neck during an altercation. You notice a large hematoma arou
nd the wound, and the patient is moaning he can’t breathe. HR 110, RR 36, SpO2 89%.
External Tools: ESI handbook PubMed Wikipedia

Step 1. Allocating Documents Step 2.Group-Based Classification Analysis Step 3. Confidence Report Summarization

Doctor A: High ESI Doctor C: Based on the provided


Level, with 90% confidence. clinical record, as an expert
Doctor B: Initial classification: ESI medical doctor,Confidence score f
level 2 or initial assessment of ESI2: 90%.
Confidence score: 80% Upon further reflection and consi
After debating, I revised dering...Confidence score after se Key Information: References:
the classification: ESI level cond revision for ESI-1: 95% - ESI Handbook v4, Chapter
1, Revised confidence score: 95% 2: ESI Triage Algorithm, p. 10-13
Confidence score:
Rationales:
Total Analysis:

Step 4. Collaborative Discussion Step 5.Consensus Agreement

Consensus
Report Report Final Answer
Reachout
Discussion Report
Discussion
Discussion

Figure 2: A framework of our proposed heterogeneous multi-agent collaboration method, T RIAGE AGENT, illustrates the
reasoning process through five stages when given a clinical document as input. The five stages include 1) allocating documents, 2)
group-based classification analysis, 3) confidence report summarization, 4) collaborative discussion, and 5) consensus agreement.

ical summary generation (Shaib et al., 2023). Cur- cesses, involving agents assuming specific roles
rent research on LLMs in healthcare primarily fo- (Wang et al., 2024b; Hong et al., 2023) and engag-
cuses on integrating external tools to enhance clini- ing in effective communication (Qian et al., 2023;
cal insights and refining models through instruction Wu et al., 2023; Li et al., 2023). Recent research
tuning. For instance, GeneGPT (Jin et al., 2023) has also explored improving agent performance
leverages Web APIs from the National Center for through adversarial tactics such as debates (Du
Biotechnology Information (NCBI) to access di- et al., 2023; Liang et al., 2023; Xiong et al., 2023)
verse biomedical information and then employs and negotiations (Fu et al., 2023), with innovative
GPT models for reasoning tasks. Additionally, the frameworks where agents interact competitively
methods in (Zhang et al., 2024b; Singhal et al., (Liang et al., 2023) or negotiate roles as buyers and
2022; Oniani et al., 2024; Kang et al., 2023) utilize sellers (Fu et al., 2023).
instruction tuning combined with prompt design to
adapt LLMs for various healthcare tasks, including 3 Methodology
decision support, medical question answering, and This section presents the T RIAGE AGENT frame-
disease diagnosis. work, which simulates teamwork and problem-
solving in diagnosing and planning treatment for
2.2 LLM-driven Multi-Agents Collaboration
ED cases. We explore the heterogeneous struc-
Research in both academia and industry has fo- ture formulation of the framework, as introduced
cused on autonomous agents trained in isolated, in Appendix A. We describe the five stages of
self-contained environments with limited knowl- T RIAGE AGENT’s operation stages, illustrated in
edge bases (Wang et al., 2024a). Significant ad- Figure 2 and detailed in the following procedures:
vancements have been achieved in deploying LLM- First, allocating documents, where patient clinical
based agents capable of independently sensing and records are assigned to expert agents to initiate ESI
decision-making, as detailed in (Yao et al., 2023; discussions. Second, group-based classification
Xie et al., 2023; Zhou et al., 2023). The trend has analysis, where agents are divided into two groups
shifted towards collaborative multi-agent systems, to conduct coarse and fine-grained classification.
which enhance the capabilities of LLM agents Third, confidence report summarization, where a
through iterative feedback and teamwork, as dis- summarized report is generated based on previous
cussed in (Xi et al., 2023; Wang et al., 2024b; Li analyses, including classification results and con-
et al., 2023; Beigi et al., 2024). These systems fidence scores. Fourth, collaborative discussion,
emulate human learning and decision-making pro- where agents engage in discussions over the sum-
5749
marized report, iteratively refining it according to 2 with 90% confidence and later refines the classifi-
key information, rationales, and confidence scores. cation to level 1 with 95% confidence after further
Fifth, consensus agreement, resulting in an ul- consideration.
timate, precise, and thoroughly validated revised By combining the two strategies described above,
report, highlighting the importance of collaborative T RIAGE AGENT effectively addresses the chal-
decision-making. lenges of distinguishing boundaries between ESI
levels and enhances decision-making efficiency and
3.1 Document Allocation accuracy through this collaborative effort.
Given a patient’s narrative clinical note P =
3.3 Confidence Report Summarization
{P1 , P2 , . . .}, a clinical natural language query q,
and a structured ESI handbook with level refer- In this stage, the summarizer agent As summarizes
ences R = {R1 , R2 , . . .}, this stage assigns doc- previous document classification results, includ-
uments to agents and initiates expert discussions, ing confidence scores, rationales, and supporting
as illustrated in Figure 2 Step 1. For more details evidence from (A1 ,A2 , A3 ). This step consoli-
about query, see Appendix B. dates the findings and uses each agent’s analysis
report to construct the summary prompt P romptrs ,
3.2 Group-Based Document Classification ensuring a well-supported and reliable decision.
Additionally, debates among the agents are in-
The classification stage is summarized as a func-
corporated into the summarizer’s prompt. The
tion f : (P, R, q) → C, where C represents the
summarizer then generates a synthesized report
set of hierarchically structured ESI-level categories.
by extracting key information and analyzing the
To improve classification efficiency and accuracy,
previous classifications provided by the agents.
agents are divided into two groups, employing a
This process can be mathematically formulated
direct and coarse-to-fine-grained classification, re-
as: Repo = LLM (P, R, rrs , P romptrs ), where
spectively.
Repo represents the synthesized report, P denotes
Coarse-to-Fine-Grained Classification This the patient’s clinical notes, R refers to the ESI
group comprises two agents. The first agent, A1 , handbook references, rrs is the role of the summa-
performs an initial coarse classification of the pa- rizer, and P romptrs is the guideline prompt for
tient record Pi into two broad categories: high the summarizer, including analysis reports from
(levels 1, 2, and 3) or low (levels 3, 4, and 5) , (A1 ,A2 , A3 ). The synthesized report is structured
with level 3 included in both. This can be rep- as follows: Repo=[key information; confidence
resented as A1 : Pi → {Chigh , Clow }. The sec- score; rationale; consolidated analysis]. This ap-
ond agent, A2 , then refines these broad categories proach effectively combines insights from multiple
into detailed ESI levels: if classified as Chigh , the agents, ensuring that the triage decision is based
second agent selects from {1, 2, 3}; if Clow , it se- on comprehensive and validated information. Con-
lects from {3, 4, 5}. This can be represented as: sequently, this method enhances the accuracy in-
A2 : {Chigh , Clow } → {{1, 2, 3}, {3, 4, 5}}. This terpretability and reliability of the clinical triage
two-step process reduces misclassifications and en- process. As depicted in Step 3 of Figure 2, the
hances precision. As illustrated in the left of step 2 key information includes references from the ESI
in Figure 2, the first agent,A1 (referred to as Doc- Handbook (e.g., ESI Handbook v4, Chapter 2: ESI
tor A), initially assigns a high ESI level with 90% Triage Algorithm, p. 10-13). The summarized
confidence. A2 (referred to as Doctor B) then re- report consolidates rationales and total analysis, en-
fines the high-level category to ESI level 2 with an suring that all relevant information is considered.
80% confidence score. After rounds of discussion,
3.4 Collaborative Discussion
Doctor B revises the classification to ESI level 1,
achieving a revised confidence score of 95%. In this stage, agents engage in multiple rounds
of discussions based on the synthesized summary
Direct Fine-Grained Classification This group report Repo to refine their individual classifica-
consists of a single agent, A3 , who directly as- tions. Unlike the commonly-used voting mecha-
signs ESI levels: A3 : Pi → {1, 2, 3, 4, 5}. As nism (Tang et al., 2024), T RIAGE AGENT critically
illustrated in the right of step 2 in Figure 2, A3 reflects on the classification results, reasoning, and
(referred to as Doctor C) initially assigns ESI level confidence scores, incorporating peer-provided evi-
5750
dence. Each agent Ai starts with an initial classi- Table 1: Statistics of the clinical triage dataset
Dataset Training Test-1 Test-2 Test-3
fication result Ci and confidence score Si follows # of Docs 218 72 72 72
the following process: Ai reviews the classification
results Cj , reasoning Rj , and confidence scores
Sj from every other agent Aj (j ̸= i). If Ai is single classification, leveraging their combined do-
persuaded by Aj ’s reasoning or finds Sj > Si , it main knowledge to validate the final decision. This
updates Ci to Cj with explanations. Conversely, collaborative process ensures the final decision is
if Ai rejects Aj ’s reasoning or finds Sj lower or robust and well-supported by comprehensive anal-
equal to Si ,it justifies keeping Ci . This iterative ysis. As shown in Step 5 of Figure 2, the final con-
process continues until agents reach a preliminary sensus is reached and the definitive classification is
consensus or the early stopping mechanism is trig- provided after all agents agree on the outcome.
gered. In Step 4 of Figure 2, agents participate in
collaborative discussions to resolve discrepancies
and refine the report. 4 Experimental Setup

3.5 Early-stopping Mechanism Dataset We construct a clinical triage dataset by


To enhance the efficiency of group chat discus- collecting patient cases from the publicly available
sions, we implement an early-stopping mechanism Emergency Severity Index (ESI) Handbook v4 (esi,
inspired by Byzantine Consensus theory (Castro Accessed: 2024-04-06). This dataset is designed
and Liskov, 1999). This approach requires at least to evaluate machine learning models and methods
3p + 1 agents to handle p faulty agents in a single for categorizing ESI levels in medical documents.
communication round. Additionally, our termina- To our knowledge, this is the first publicly released
tion mechanism draws inspiration from advance- clinical triage dataset that includes clinical notes
ments in LLMs fine-tuned with Reinforcement and corresponding ESI levels necessary for triage
Learning from Human Feedback (RLHF), allowing tasks, serving as a benchmark for evaluating our
consensus after several debate rounds (Du et al., framework’s effectiveness. Since the patient cases
2023; Ouyang et al., 2022). The mechanism termi- are sourced from the official ESI Handbook, no de-
nates communication when agents consistently con- identification is needed. The dataset is divided into
firm their reasoning with high confidence, thereby a training set and three test sets (test-1, test-2, and
reducing unnecessary computations. It operates test-3). The training set contains 218 cases with the
under two conditions: the first is repetition of high- following distribution across ESI levels: ESI-1 (14),
confidence answers by a single agent: if an agent ESI-2 (92), ESI-3 (65), ESI-4 (22), and ESI-5 (25).
repeatedly provides the same answer with high con- Each test set contains 72 cases, maintaining fixed
fidence, that agent triggers early-stopping and exits proportions of ESI levels: ESI-1 (12), ESI-2 (20),
the group discussions. The second is repetition of ESI-3 (13), ESI-4 (12), and ESI-5 (15). The dataset
high-confidence answers by multiple agents: if all statistics are provided in Table 1. For additional
agents consistently provide the same answer with details, including copyright information, refer to
high confidence, the dialogue is terminated. This Appendix C.
dynamic, real-time stopping condition enhances Our dataset contribution includes key aspects:
the traditional theory’s efficiency, ensuring efficient We have organized and publicly released clinical
and accurate consensus in group discussions. These notes and ESI levels from the ESI Handbook, thor-
conditions collectively foster an adaptive termina- oughly cleaned, processed, and annotated for use
tion criterion, prioritizing efficiency and accuracy in machine learning. Additionally, we provide a
in reaching conclusions (Yin et al., 2023). In our human expert performance baseline as a bench-
case, we apply the early-stopping mechanism to mark for researchers to compare and improve their
each round of discussion of the T RIAGE AGENT. models. Although the cases come from the official
textbook, we invested significant effort in curating
3.6 Consensus Agreement and processing the dataset to ensure its high quality
Finally, agents reach a formal consensus by inte- and practical use. The ESI labels are drawn directly
grating the refined answers, reasoning, and con- from the authoritative guidance in the ESI Hand-
fidence scores from the collaborative discussion book, ensuring they are based on standardized, ex-
stage. This stage ensures all agents agree on a pert knowledge rather than manual annotations.
5751
Table 2: Performance comparison of T RIAGE AGENT with baseline methods on the clinical triage dataset. The reported
performance (%) in this table are averaged values from the three test sets in the dataset (Table 1). The highest performance
is highlighted in bold. A lower total discordance represents a higher performance. CoT denotes chain-of-thought prompting,
SCtr denotes self-contrast prompting, SCons denotes self-consistency prompting, and EoT denotes the exchange-of-thought
prompting method.
Total Significant Significant
Supervision Method UnderTriage OverTriage
Discordance UnderTriage OverTriage
GPT-3.5
-Supervised Vanilla 39.18 21.76 15.28 18.06 6.85
-Zero-shot MedAgent (w/Handbook) 39.58 5.56 5.56 34.03 15.97
CoT (1-Agnt) 41.40 16.70 12.50 24.70 8.33
SCtr (1-Agnt) 39.35 11.57 8.33 27.78 11.58
SCons (1-Agnt) 36.11 17.59 8.33 15.74 7.87
EoT (4-Agnt) 36.81 11.57 6.94 15.28 9.72
SCons (4-Agnt) 34.72 5.56 4.63 27.31 10.65
SCons (4-Agnt) (w/Handbook) 31.02 7.41 7.41 23.61 6.94
SCons (4-Agnt)+Confidence 32.87 6.02 6.02 26.85 10.65
(w/Handbook)
T RIAGE AGENT (Vanilla) 34.72 5.56 4.63 27.31 10.56
T RIAGE AGENT (w/Handbook) 31.02 7.87 7.41 22.69 5.56
T RIAGE AGENT (Ours) 30.56 6.94 6.48 24.54 9.72
GPT-4
-Supervised Vanilla 23.50 8.10 6.94 14.80 8.33
Vanilla 22.68 9.50 5.70 7.10 1.90
(w/Handbook)
-Zero-shot MedAgent (w/Handbook) 30.56 4.17 3.24 25.93 18.52
CoT (1-Agnt) 37.40 14.30 8.33 23.30 10.64
EoT (4-Agnt) 29.86 9.03 5.56 20.83 12.50
SCons (4-Agnt) 29.63 11.11 7.87 18.06 8.33
SCons (4-Agnt) (w/Handbook) 23.61 5.09 3.70 18.52 9.26
SCons (4-Agnt)+ Confidence 23.61 5.09 3.70 18.52 9.26
(w/Handbook)
T RIAGE AGENT (Vanilla) 29.63 11.11 7.87 18.06 8.33
T RIAGE AGENT (w/Handbook) 23.61 5.09 3.70 18.52 9.26
T RIAGE AGENT (Ours) 18.98 2.30 2.80 17.10 8.80
Human Eval Human Experts 31.43 12.80 8.61 18.60 10.50

Implementation We use GPT-3.5-Turbo (Ope- • Self-contrast (Zhang et al., 2024a) improves sta-
nAI, 2024) and GPT-4 (OpenAI et al., 2024) from bility and accuracy by contrasting different solv-
OpenAI for as our base models for the zero-shot ing perspectives and summarizing discrepancies.
experiments. Our T RIAGE AGENT framework uti- We applied self-contrast on our clinical triage
lizes the publicly open-sourced Autogen frame- dataset to analyze and reconcile conflicting clas-
work from Microsoft 1 . The temperature is 0.9, sification results as a baseline for comparison.
topk is 1.0, and the cache seed is 42. The maxi- • Self-consistency (Wang et al., 2023b) enhances
mum number of iterations is 12, and the frequency zero-shot and few-shot CoT by generating pre-
penalty is 0.1. For SCtr and SCons, we perform 8 dominant responses through multiple chain sam-
iterations with a temperature of 0.9. plings. We implemented self-consistency on our
clinical triage dataset to generate multiple re-
Baselines The performance of the
sponse chains and select the most consistent an-
T RIAGE AGENT framework is evaluated against
swers as a baseline for comparison.
several state-of-the-art baselines, including
methods that employ LLM-based planning, tool • Exchange-of-thought (Yin et al., 2023) en-
usage, and retrieval-augmented generation. ables cross-model communication and problem-
solving integration. We implemented EoT on
• Chain-of-thought (Kojima et al., 2023) integrates our clinical triage dataset to facilitate communi-
step-by-step reasoning into the prompt for LLMs. cation between various agents as a baseline for
We implemented CoT on our clinical triage comparison.
dataset as a baseline method for comparison. • MedAgent (Tang et al., 2024) is a role-playing
1
https://microsoft.github.io/autogen/ collaboration framework for medical scenarios

5752
using LLMs. We utilized MedAgent’s multi- by 9.02% and EoT by 6.25%. Table 2 outlines
agent framework on our clinical triage dataset the three variations of our TRIAGEAGENT frame-
as a baseline for comparison. work.
• T RIAGE AGENT utilizes dynamically updated
confidence scores from various reasoning per- Comparison with CoT Methods Performance
spectives supported by external evidence to en- can decline when employing overly complex Chain
hance the performance. It selects the top K most of Thought (CoT) methods. Simply stacking
confident answers for critical assessment. Agents prompts without a clear, logical sequence can re-
employ retrieval-augmented generation to refine sult in hallucinations—erroneous outputs caused
choices, ensuring minimal discordance and max- by insufficient document comprehension and mis-
imal coherence. This iterative process integrates understanding of medical terminologies. However,
multiple methodologies and specialized knowl- our approach, which integrates multi-agent role-
edge retrieval to improve decision accuracy and playing with confidence assessments, effectively
reliability. addresses these issues and demonstrates its poten-
tial as a more robust method for applying LLMs in
Evaluation Protocol Our primary evaluation clinical triage.
metric is the total discordance rate, which mea-
sures the percentage of incorrectly predicted Comparison with Single-Agent Methods Meth-
queries. This metric is critical as it provides a ods such as CoT, self-contrast and supervised learn-
comprehensive overview of the accuracy; a lower ing lack crucial interactions among multiple LLMs.
value indicates better performance. Additionally, This absence inhibits these methods from adap-
we evaluate the undertriage rate, overtriage rate, tively refining their responses, leading to subopti-
significant undertriage rate, and significant over- mal performance in triage question-answering sce-
triage rate to assess the model’s performance in narios. Consequently, their discordance rates aver-
specific areas of clinical urgency categorization. age around 38.95% on our clinical triage dataset
While these metrics are important, the total dis- when using GPT-3.5 (as illustrated in Table 2), high-
cordance rate remains the primary measure of ac- lighting the need for improvement.
curacy. Detailed definitions of the five ESI levels
5.2 Ablation Study
(I-V) and the evaluation metrics are provided in
Appendix D. Our ablation study analyzes team optimization and
external resource optimization. After determining
5 Results the optimal structure, the T RIAGE AGENT frame-
work simulates multi-role team collaboration, en-
5.1 Main Results
abling agents to acquire the necessary capabilities
We evaluate the performance of T RIAGE AGENT to effectively accomplish triage tasks.
by averaging the results across the three test sets.
The backbone LLMs in our experiments include Team Optimization Our heterogeneous frame-
GPT-3.5, GPT-4, Llama-2, and Llama-3. Table 2 work employs multiple agents, each with a specific
presents the main results with a better performance role, to achieve optimal outcomes. We explored var-
from the GPT-3.5 and GPT-4 models. Detailed re- ious configurations to enhance team performance.
sults for Llama-2-7B and Llama-3-8B are provided By adjusting the number of role-specific agents, we
in Appendix E. found that four agents provided the most optimized
structure for performance on the ESI triage dataset.
Performance Comparison with Baselines The Figure 3(a) details this optimal configuration and
performance of our framework compared to state- demonstrates how adjusting agent roles and num-
of-the-art (SOTA) methods is presented in Ta- bers enhances overall system performance.
ble 2. The T RIAGE AGENT framework surpasses
traditional prompt engineering methods, includ- External Resource Optimization We evaluate
ing supervised learning, self-contrast, and self- the ESI Handbook, PubMed, and Wikipedia for su-
consistency, with improvements of 9.25%, 8.79%, pervising our model (Figure 3(c)). The ESI Hand-
and 5.55%, respectively. It also outperforms SOTA book proved to be the most effective, significantly
multi-agent frameworks in a zero-shot setting using enhancing model performance with its targeted clin-
the GPT-3.5-turbo model, exceeding MedAgents ical diagnostic guidelines. In contrast, PubMed
5753
(a) (b) (c)
Figure 3: Ablation study results. (a), (b) and (c) show the impact of agent structure composition, optimizing the agent team with
the ESI handbook and optimizing the agent team with various external resources, respectively. The lower Total Discordance
value in this figure represents better performance.

and Wikipedia require processing extensive addi- Table 3: Time efficiency on multiple agents (average seconds
per test case)
tional data. We also analyze the optimal number of Model Test-1 Test-2 Test-3
agents using the ESI Handbook to maximize task CoT (GPT-3.5) 0.17 0.19 0.17
efficiency (Figure 3(b)). Increasing the number CoT (GPT-4) 0.16 0.19 0.17
Self-Consistency (GPT-3.5) 0.21 0.23 0.24
of agents improved model effectiveness without Three-agents (GPT-3.5) 0.55 1.01 0.58
significantly changing overall accuracy, enhancing Three-agents (GPT-4) 0.57 1.00 0.59
Four-agents (GPT-3.5) 1.28 1.36 1.41
the contextual understanding and interpretability of Four-agents (GPT-4) 1.31 1.56 1.53
medical texts while optimizing resource utilization. Five-agents (GPT-3.5) 1.55 1.56 1.49
Six-agents (GPT-3.5) 2.11 2.05 2.13
5.3 Case Study TriagAgent (GPT-3.5) 1.30 1.45 1.52
TriagAgent (GPT-4) 1.31 1.43 1.50
Error Analysis Based on our findings, we con-
duct an expert evaluation to identify key limita- Table 4: Cost efficiency on ESI datasets classification
tions and common issues in our model. As shown Model Performance #API Calls
in Figure 4, we categorize these errors into four Two-agent (GPT-3.5) 38.42% 324
Three-agent (GPT-3.5) 38.42% 486
major types. The first type of error is lack of doc- Four-agents (GPT-3.5) 30.56% 648
ument understanding. This type of error arises Four-agents (GPT-4) 18.98% 604
from insufficient medical knowledge or incorrect CoT (GPT-4) 37.40% 216
linking to ESI levels, leading to misjudgments of
clinical severity. The second type of error is mis-
retrieval of domain knowledge. Errors result from gible increase in time costs compared to a single-
inaccurately retrieving irrelevant or mismatched in- agent model. Feedback from emergency depart-
formation will compromise triage accuracy. The ment experts indicates that our method does not sig-
third type of error is confidence-based consistency nificantly affect overall time efficiency but greatly
errors. This type of errors is caused by confi- enhances decision accuracy, which is crucial for
dently providing contradictory responses or failing better patient outcomes and resource allocation.
to reach a consensus, often due to internal ineffi-
ciencies or flaws in the early stopping mechanism,
resulting in falsely assured incorrect outcomes. The Cost Analysis Operational efficiency, particu-
last type of errors is exchange of information er- larly API token usage, is crucial for our framework.
rors. This type of errors results from incorrect Table 4 compares the performance and costs of
data transfer between agents, disrupting logical se- different configurations. While T RIAGE AGENT re-
quences and leading to erroneous conclusions. See quires more API calls than a single-agent setup,
Appendix F for more details. feedback from clinical departments suggests these
costs are reasonable. Our system improves effi-
Time Analysis Time efficiency is crucial in emer- ciency by 12.54% compared to human experts man-
gency clinical triage. Table 3 compares the perfor- ually classifying documents and achieves 18.54%
mance and time costs of different methods. Our higher performance than LLMs using CoT prompts
multi-agent architecture incurs a slight but negli- operated by human experts.
5754
Limitations
15%
• Document Understanding Errors
11%
43%


Mis-retrieval of Domain Knowledge
Confidence-based Consistency Errors
In this paper, we introduce a heterogeneous
• Exchange of Information Errors multi-agent collaboration framework called
31%
T RIAGE AGENT. Despite our efforts, the frame-
Figure 4: Ratio of different categories of error cases.
work faces limitations inherent to the healthcare
industry.
Limited Expert Evaluation Our research is lim-
6 Conclusions ited by the involvement of three human experts,
restricting the scope and depth of expert evalua-
This paper introduces a novel heterogeneous multi-
tion, which may impact the generalizability of our
agent framework, T RIAGE AGENT for clinical
findings to broader clinical settings.
triage, utilizing LLM-based role-playing agents
in a multi-stage group chat setting. This zero- Cross-Institution Collaboration The complex-
shot, training-free, and interpretable framework ity of emergency departments requires advanced
comprises five significant stages. Experiments on triage systems to address patient conditions within
clinical triage datasets demonstrate our framework the same urgency levels. This need arises from var-
significantly outperforms zero-shot baselines and ing institutional conditions,protocols, and patient
experienced professionals. Case studies and hu- demographics.
man evaluations highlight areas for improvement, Workflow Integration Our model provides a fi-
such as reducing document understanding errors nal triage decision, but actual triage often involves
and knowledge misretrieval. Future research can multiple decision-making stages. Thus, it’s cru-
enhance the framework’s efficiency by improving cial to evaluate how well our system integrates into
document comprehension and correcting domain existing workflows and complements human-led
knowledge retrieval errors. Upon acceptance, we emergency care.
will release our dataset as open source. Privacy Deploying our framework necessitates
strict privacy measures and clinical worker train-
Acknowledgement ing. Processing clinical notes can expose sensitive
information, making compliance with HIPAA and
Our work is sponsored by the NSF NAIRR Pilot
GDPR.
and PSC Neocortex, Commonwealth Cyber Initia-
tive, Children’s National Hospital, Fralin Biomed- Time and Cost Efficiency T RIAGE AGENT aims
ical Research Institute (Virginia Tech), Sanghani to enhance clinical triage by automating initial pa-
Center for AI and Data Analytics (Virginia Tech), tient assessment, reducing manual workload, wait-
Virginia Tech Innovation Campus, and a generous ing times, burnout and operational costs. However,
gift from the Amazon + Virginia Tech Center for using external technologies like OpenAI’s API adds
Efficient and Robust Machine Learning. expenses and dependencies. A detailed cost-benefit
analysis is essential for sustainability and economic
Ethics Statement viability.

This research adhered to the highest ethical stan-


dards and best practices, as outlined in the ACL References
Code of Ethics. All experiments were conducted Accessed: 2024-04-06. Emergency severity index (esi):
using publicly available datasets, thereby avoid- A triage tool for emergency department. Avail-
ing concerns related to privacy, confidentiality, or able online. https://media.emscimprovement.
personal information. The datasets used are fully center/documents/ESI_Handbook2125.pdf.
anonymized and have been vetted to ensure com- Zhijie Bao, Wei Chen, Shengze Xiao, Kuang Ren, Jiaao
pliance with ethical guidelines. Additionally, we Wu, Cheng Zhong, Jiajie Peng, Xuanjing Huang, and
have carefully considered the broader impacts and Zhongyu Wei. 2023. Disc-medllm: Bridging gen-
potential applications of our work, ensuring that it eral large language models and real-world medical
consultation.
does not inadvertently cause harm or misuse. Con-
sequently, we believe this research is free from Mohammad Beigi, Ying Shen, Runing Yang, Zihao
ethical issues. Lin, Qifan Wang, Ankith Mohan, Jianfeng He, Ming

5755
Jin, Chang-Tien Lu, and Lifu Huang. 2024. Inter- Minki Kang, Seanie Lee, Jinheon Baek, Kenji
nalinspector i2 : Robust confidence estimation in llms Kawaguchi, and Sung Ju Hwang. 2023. Knowledge-
through internal states. augmented reasoning distillation for small language
models in knowledge-intensive tasks.
Zhenyu Bi, Sajib Acharjee Dip, Daniel Hajialigol, Sind-
hura Kommu, Hanwen Liu, Meng Lu, and Xuan Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yu-
Wang. 2024. Ai for biomedicine in the era of large taka Matsuo, and Yusuke Iwasawa. 2023. Large lan-
language models. guage models are zero-shot reasoners.

Miguel Castro and Barbara Liskov. 1999. Practical Yuan Li, Yixuan Zhang, and Lichao Sun. 2023. Metaa-
byzantine fault tolerance. In Proceedings of the Third gents: Simulating interactions of human behaviors
Symposium on Operating Systems Design and Imple- for llm-based task-oriented coordination via collabo-
mentation, pages 173–186, USA. USENIX Associa- rative generative agents.
tion. Tian Liang, Zhiwei He, Wenxiang Jiao, Xing Wang,
Yan Wang, Rui Wang, Yujiu Yang, Zhaopeng Tu, and
G Chenais, E Lagarde, and C Gil-Jardiné. 2023. Artifi- Shuming Shi. 2023. Encouraging divergent thinking
cial intelligence in emergency medicine: Viewpoint in large language models through multi-agent debate.
of current applications and foreseeable opportunities
and challenges. J Med Internet Res, 25:e40031. Chen Ling, Xujiang Zhao, Jiaying Lu, Chengyuan Deng,
Can Zheng, Junxiang Wang, Tanmoy Chowdhury,
Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Yun Li, Hejie Cui, Xuchao Zhang, Tianjiao Zhao,
Maarten Bosma, Gaurav Mishra, Adam Roberts, Amit Panalkar, Dhagash Mehta, Stefano Pasquali,
Paul Barham, Hyung Won Chung, Charles Sutton, Wei Cheng, Haoyu Wang, Yanchi Liu, Zhengzhang
Sebastian Gehrmann, Parker Schuh, Kensen Shi, Chen, Haifeng Chen, Chris White, Quanquan Gu,
Sasha Tsvyashchenko, Joshua Maynez, Abhishek Jian Pei, Carl Yang, and Liang Zhao. 2024. Domain
Rao, Parker Barnes, Yi Tay, Noam Shazeer, Vin- specialization as the key to make large language mod-
odkumar Prabhakaran, Emily Reif, Nan Du, Ben els disruptive: A comprehensive survey.
Hutchinson, Reiner Pope, James Bradbury, Jacob
Austin, Michael Isard, Guy Gur-Ari, Pengcheng Yin, Zhengliang Liu, Zihao Wu, Mengxuan Hu, Bokai Zhao,
Toju Duke, Anselm Levskaya, Sanjay Ghemawat, Lin Zhao, Tianyi Zhang, Haixing Dai, Xianyan Chen,
Sunipa Dev, Henryk Michalewski, Xavier Garcia, Ye Shen, Sheng Li, Brian Murray, Tianming Liu, and
Vedant Misra, Kevin Robinson, Liam Fedus, Denny Andrea Sikora. 2023. Pharmacygpt: The ai pharma-
Zhou, Daphne Ippolito, David Luan, Hyeontaek Lim, cist.
Barret Zoph, Alexander Spiridonov, Ryan Sepassi,
David Dohan, Shivani Agrawal, Mark Omernick, An- Danqing Ma, Meng Wang, Ao Xiang, Zongqing Qi, and
drew M. Dai, Thanumalayan Sankaranarayana Pil- Qin Yang. 2024. Transformer-based classification
lai, Marie Pellat, Aitor Lewkowycz, Erica Moreira, outcome prediction for multimodal stroke treatment.
Rewon Child, Oleksandr Polozov, Katherine Lee,
Zongwei Zhou, Xuezhi Wang, Brennan Saeta, Mark Harsha Nori, Nicholas King, Scott Mayer McKinney,
Diaz, Orhan Firat, Michele Catasta, Jason Wei, Kathy Dean Carignan, and Eric Horvitz. 2023. Capabilities
Meier-Hellstern, Douglas Eck, Jeff Dean, Slav Petrov, of gpt-4 on medical challenge problems.
and Noah Fiedel. 2022. Palm: Scaling language mod- David Oniani, Xizhi Wu, Shyam Visweswaran, Sumit
eling with pathways. Kapoor, Shravan Kooragayalu, Katelyn Polanska,
and Yanshan Wang. 2024. Enhancing large language
Yilun Du, Shuang Li, Antonio Torralba, Joshua B. models for clinical decision support by incorporating
Tenenbaum, and Igor Mordatch. 2023. Improving clinical practice guidelines.
factuality and reasoning in language models through
multiagent debate. OpenAI. 2024. Gpt-3.5 turbo documentation.
https://platform.openai.com/docs/models/
Yao Fu, Hao Peng, Tushar Khot, and Mirella Lapata. gpt-3-5-turbo. Accessed: 2024-06-15.
2023. Improving language model negotiation with
self-play and in-context learning from ai feedback. OpenAI, Josh Achiam, Steven Adler, Sandhini Agarwal,
Lama Ahmad, Ilge Akkaya, Florencia Leoni Ale-
Sirui Hong, Mingchen Zhuge, Jonathan Chen, Xiawu man, Diogo Almeida, Janko Altenschmidt, Sam Alt-
Zheng, Yuheng Cheng, Ceyao Zhang, Jinlin Wang, man, Shyamal Anadkat, Red Avila, Igor Babuschkin,
Zili Wang, Steven Ka Shing Yau, Zijuan Lin, Liyang Suchir Balaji, Valerie Balcom, Paul Baltescu, Haim-
Zhou, Chenyu Ran, Lingfeng Xiao, Chenglin Wu, ing Bao, Mohammad Bavarian, Jeff Belgum, Ir-
and Jürgen Schmidhuber. 2023. Metagpt: Meta pro- wan Bello, Jake Berdine, Gabriel Bernadett-Shapiro,
gramming for a multi-agent collaborative framework. Christopher Berner, Lenny Bogdonoff, Oleg Boiko,
Madelaine Boyd, Anna-Luisa Brakman, Greg Brock-
Qiao Jin, Yifan Yang, Qingyu Chen, and Zhiyong Lu. man, Tim Brooks, Miles Brundage, Kevin Button,
2023. Genegpt: Augmenting large language models Trevor Cai, Rosie Campbell, Andrew Cann, Brittany
with domain tools for improved access to biomedical Carey, Chelsea Carlson, Rory Carmichael, Brooke
information. Chan, Che Chang, Fotis Chantzis, Derek Chen, Sully

5756
Chen, Ruby Chen, Jason Chen, Mark Chen, Ben Chelsea Voss, Carroll Wainwright, Justin Jay Wang,
Chess, Chester Cho, Casey Chu, Hyung Won Chung, Alvin Wang, Ben Wang, Jonathan Ward, Jason Wei,
Dave Cummings, Jeremiah Currier, Yunxing Dai, CJ Weinmann, Akila Welihinda, Peter Welinder, Ji-
Cory Decareaux, Thomas Degry, Noah Deutsch, ayi Weng, Lilian Weng, Matt Wiethoff, Dave Willner,
Damien Deville, Arka Dhar, David Dohan, Steve Clemens Winter, Samuel Wolrich, Hannah Wong,
Dowling, Sheila Dunning, Adrien Ecoffet, Atty Eleti, Lauren Workman, Sherwin Wu, Jeff Wu, Michael
Tyna Eloundou, David Farhi, Liam Fedus, Niko Felix, Wu, Kai Xiao, Tao Xu, Sarah Yoo, Kevin Yu, Qim-
Simón Posada Fishman, Juston Forte, Isabella Ful- ing Yuan, Wojciech Zaremba, Rowan Zellers, Chong
ford, Leo Gao, Elie Georges, Christian Gibson, Vik Zhang, Marvin Zhang, Shengjia Zhao, Tianhao
Goel, Tarun Gogineni, Gabriel Goh, Rapha Gontijo- Zheng, Juntang Zhuang, William Zhuk, and Barret
Lopes, Jonathan Gordon, Morgan Grafstein, Scott Zoph. 2024. Gpt-4 technical report.
Gray, Ryan Greene, Joshua Gross, Shixiang Shane
Gu, Yufei Guo, Chris Hallacy, Jesse Han, Jeff Harris, Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Car-
Yuchen He, Mike Heaton, Johannes Heidecke, Chris roll L. Wainwright, Pamela Mishkin, Chong Zhang,
Hesse, Alan Hickey, Wade Hickey, Peter Hoeschele, Sandhini Agarwal, Katarina Slama, Alex Ray, John
Brandon Houghton, Kenny Hsu, Shengli Hu, Xin Schulman, Jacob Hilton, Fraser Kelton, Luke Miller,
Hu, Joost Huizinga, Shantanu Jain, Shawn Jain, Maddie Simens, Amanda Askell, Peter Welinder,
Joanne Jang, Angela Jiang, Roger Jiang, Haozhun Paul Christiano, Jan Leike, and Ryan Lowe. 2022.
Jin, Denny Jin, Shino Jomoto, Billie Jonn, Hee- Training language models to follow instructions with
woo Jun, Tomer Kaftan, Łukasz Kaiser, Ali Ka- human feedback.
mali, Ingmar Kanitscheider, Nitish Shirish Keskar,
Tabarak Khan, Logan Kilpatrick, Jong Wook Kim, Chen Qian, Xin Cong, Wei Liu, Cheng Yang, Weize
Christina Kim, Yongjik Kim, Jan Hendrik Kirch- Chen, Yusheng Su, Yufan Dang, Jiahao Li, Juyuan
ner, Jamie Kiros, Matt Knight, Daniel Kokotajlo, Xu, Dahai Li, Zhiyuan Liu, and Maosong Sun. 2023.
Łukasz Kondraciuk, Andrew Kondrich, Aris Kon- Communicative agents for software development.
stantinidis, Kyle Kosic, Gretchen Krueger, Vishal
Kuo, Michael Lampe, Ikai Lan, Teddy Lee, Jan Yujia Qin, Shengding Hu, Yankai Lin, Weize Chen,
Leike, Jade Leung, Daniel Levy, Chak Ming Li, Ning Ding, Ganqu Cui, Zheni Zeng, Yufei Huang,
Rachel Lim, Molly Lin, Stephanie Lin, Mateusz Chaojun Xiao, Chi Han, Yi Ren Fung, Yusheng Su,
Litwin, Theresa Lopez, Ryan Lowe, Patricia Lue, Huadong Wang, Cheng Qian, Runchu Tian, Kunlun
Anna Makanju, Kim Malfacini, Sam Manning, Todor Zhu, Shihao Liang, Xingyu Shen, Bokai Xu, Zhen
Markov, Yaniv Markovski, Bianca Martin, Katie Zhang, Yining Ye, Bowen Li, Ziwei Tang, Jing Yi,
Mayer, Andrew Mayne, Bob McGrew, Scott Mayer Yuzhang Zhu, Zhenning Dai, Lan Yan, Xin Cong,
McKinney, Christine McLeavey, Paul McMillan, Yaxi Lu, Weilin Zhao, Yuxiang Huang, Junxi Yan,
Jake McNeil, David Medina, Aalok Mehta, Jacob Xu Han, Xian Sun, Dahai Li, Jason Phang, Cheng
Menick, Luke Metz, Andrey Mishchenko, Pamela Yang, Tongshuang Wu, Heng Ji, Zhiyuan Liu, and
Mishkin, Vinnie Monaco, Evan Morikawa, Daniel Maosong Sun. 2023. Tool learning with foundation
Mossing, Tong Mu, Mira Murati, Oleg Murk, David models.
Mély, Ashvin Nair, Reiichiro Nakano, Rajeev Nayak,
Arvind Neelakantan, Richard Ngo, Hyeonwoo Noh, Chantal Shaib, Millicent L. Li, Sebastian Joseph, Iain J.
Long Ouyang, Cullen O’Keefe, Jakub Pachocki, Alex Marshall, Junyi Jessy Li, and Byron C. Wallace. 2023.
Paino, Joe Palermo, Ashley Pantuliano, Giambat- Summarizing, simplifying, and synthesizing medical
tista Parascandolo, Joel Parish, Emy Parparita, Alex evidence using gpt-3 (with varying success).
Passos, Mikhail Pavlov, Andrew Peng, Adam Perel-
man, Filipe de Avila Belbute Peres, Michael Petrov, Karan Singhal, Shekoofeh Azizi, Tao Tu, S. Sara Mah-
Henrique Ponde de Oliveira Pinto, Michael, Poko- davi, Jason Wei, Hyung Won Chung, Nathan Scales,
rny, Michelle Pokrass, Vitchyr H. Pong, Tolly Pow- Ajay Tanwani, Heather Cole-Lewis, Stephen Pfohl,
ell, Alethea Power, Boris Power, Elizabeth Proehl, Perry Payne, Martin Seneviratne, Paul Gamble, Chris
Raul Puri, Alec Radford, Jack Rae, Aditya Ramesh, Kelly, Nathaneal Scharli, Aakanksha Chowdhery,
Cameron Raymond, Francis Real, Kendra Rimbach, Philip Mansfield, Blaise Aguera y Arcas, Dale Web-
Carl Ross, Bob Rotsted, Henri Roussez, Nick Ry- ster, Greg S. Corrado, Yossi Matias, Katherine Chou,
der, Mario Saltarelli, Ted Sanders, Shibani Santurkar, Juraj Gottweis, Nenad Tomasev, Yun Liu, Alvin Ra-
Girish Sastry, Heather Schmidt, David Schnurr, John jkomar, Joelle Barral, Christopher Semturs, Alan
Schulman, Daniel Selsam, Kyla Sheppard, Toki Karthikesalingam, and Vivek Natarajan. 2022. Large
Sherbakov, Jessica Shieh, Sarah Shoker, Pranav language models encode clinical knowledge.
Shyam, Szymon Sidor, Eric Sigler, Maddie Simens,
Jordan Sitkin, Katarina Slama, Ian Sohl, Benjamin Haotian Sun, Yuchen Zhuang, Lingkai Kong, Bo Dai,
Sokolowsky, Yang Song, Natalie Staudacher, Fe- and Chao Zhang. 2023. Adaplanner: Adaptive plan-
lipe Petroski Such, Natalie Summers, Ilya Sutskever, ning from feedback with language models.
Jie Tang, Nikolas Tezak, Madeleine B. Thompson,
Phil Tillet, Amin Tootoonchian, Elizabeth Tseng, Rocío Sánchez-Salmerón et al. 2022. Machine learning
Preston Tuggle, Nick Turley, Jerry Tworek, Juan Fe- methods applied to triage in emergency services: A
lipe Cerón Uribe, Andrea Vallone, Arun Vijayvergiya, systematic review. International Emergency Nursing,
60:101109.

5757
Xiangru Tang, Anni Zou, Zhuosheng Zhang, Ziming Wensen Cheng, Qi Zhang, Wenjuan Qin, Yongyan
Li, Yilun Zhao, Xingyao Zhang, Arman Cohan, and Zheng, Xipeng Qiu, Xuanjing Huang, and Tao Gui.
Mark Gerstein. 2024. Medagents: Large language 2023. The rise and potential of large language model
models as collaborators for zero-shot medical reason- based agents: A survey.
ing.
Tianbao Xie, Fan Zhou, Zhoujun Cheng, Peng Shi, Lu-
Hugo Touvron, Louis Martin, Kevin Stone, Peter Al- oxuan Weng, Yitao Liu, Toh Jing Hua, Junning Zhao,
bert, Amjad Almahairi, Yasmine Babaei, Nikolay Qian Liu, Che Liu, Leo Z. Liu, Yiheng Xu, Hongjin
Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Su, Dongchan Shin, Caiming Xiong, and Tao Yu.
Bhosale, Dan Bikel, Lukas Blecher, Cristian Canton 2023. Openagents: An open platform for language
Ferrer, Moya Chen, Guillem Cucurull, David Esiobu, agents in the wild.
Jude Fernandes, Jeremy Fu, Wenyin Fu, Brian Fuller,
Cynthia Gao, Vedanuj Goswami, Naman Goyal, An- Kai Xiong, Xiao Ding, Yixin Cao, Ting Liu, and Bing
thony Hartshorn, Saghar Hosseini, Rui Hou, Hakan Qin. 2023. Examining inter-consistency of large lan-
Inan, Marcin Kardas, Viktor Kerkez, Madian Khabsa, guage models collaboration: An in-depth analysis via
Isabel Kloumann, Artem Korenev, Punit Singh Koura, debate.
Marie-Anne Lachaux, Thibaut Lavril, Jenya Lee, Di-
ana Liskovich, Yinghai Lu, Yuning Mao, Xavier Mar- LH Yao, KC Leung, CL Tsai, CH Huang, and LC Fu.
tinet, Todor Mihaylov, Pushkar Mishra, Igor Moly- 2021. A novel deep learning-based system for triage
bog, Yixin Nie, Andrew Poulton, Jeremy Reizen- in the emergency department using electronic medi-
stein, Rashi Rungta, Kalyan Saladi, Alan Schelten, cal records: Retrospective cohort study. J Med Inter-
Ruan Silva, Eric Michael Smith, Ranjan Subrama- net Res, 23(12):e27008.
nian, Xiaoqing Ellen Tan, Binh Tang, Ross Tay- Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak
lor, Adina Williams, Jian Xiang Kuan, Puxin Xu, Shafran, Karthik Narasimhan, and Yuan Cao. 2023.
Zheng Yan, Iliyan Zarov, Yuchen Zhang, Angela Fan, React: Synergizing reasoning and acting in language
Melanie Kambadur, Sharan Narang, Aurelien Ro- models.
driguez, Robert Stojnic, Sergey Edunov, and Thomas
Scialom. 2023. Llama 2: Open foundation and fine- Zhangyue Yin, Qiushi Sun, Cheng Chang, Qipeng
tuned chat models. Guo, Junqi Dai, Xuanjing Huang, and Xipeng Qiu.
2023. Exchange-of-thought: Enhancing large lan-
Lei Wang, Chen Ma, Xueyang Feng, Zeyu Zhang, Hao guage model capabilities through cross-model com-
Yang, Jingsen Zhang, Zhiyuan Chen, Jiakai Tang, munication.
Xu Chen, Yankai Lin, Wayne Xin Zhao, Zhewei Wei,
and Ji-Rong Wen. 2024a. A survey on large language Wenqi Zhang, Yongliang Shen, Linjuan Wu, Qiuying
model based autonomous agents. Peng, Jun Wang, Yueting Zhuang, and Weiming Lu.
2024a. Self-contrast: Better reflection through incon-
Weizhi Wang, Li Dong, Hao Cheng, Xiaodong Liu, sistent solving perspectives.
Xifeng Yan, Jianfeng Gao, and Furu Wei. 2023a.
Augmenting language models with long-term mem- Xinlu Zhang, Chenxin Tian, Xianjun Yang, Lichang
ory. Chen, Zekun Li, and Linda Ruth Petzold. 2024b.
Alpacare:instruction-tuned large language models for
Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc medical application.
Le, Ed Chi, Sharan Narang, Aakanksha Chowdhery,
and Denny Zhou. 2023b. Self-consistency improves Shuyan Zhou, Frank F. Xu, Hao Zhu, Xuhui Zhou,
chain of thought reasoning in language models. Robert Lo, Abishek Sridhar, Xianyi Cheng, Tianyue
Ou, Yonatan Bisk, Daniel Fried, Uri Alon, and Gra-
Zhenhailong Wang, Shaoguang Mao, Wenshan Wu, Tao ham Neubig. 2023. Webarena: A realistic web envi-
Ge, Furu Wei, and Heng Ji. 2024b. Unleashing the ronment for building autonomous agents.
emergent cognitive synergy in large language mod-
els: A task-solving agent through multi-persona self- Yuchen Zhuang, Yue Yu, Kuan Wang, Haotian Sun,
collaboration. and Chao Zhang. 2023. Toolqa: A dataset for llm
question answering with external tools.
Qingyun Wu, Gagan Bansal, Jieyu Zhang, Yiran Wu,
Beibin Li, Erkang Zhu, Li Jiang, Xiaoyun Zhang,
Shaokun Zhang, Jiale Liu, Ahmed Hassan Awadal-
lah, Ryen W White, Doug Burger, and Chi Wang.
2023. Autogen: Enabling next-gen llm applications
via multi-agent conversation.

Zhiheng Xi, Wenxiang Chen, Xin Guo, Wei He, Yiwen


Ding, Boyang Hong, Ming Zhang, Junzhe Wang,
Senjie Jin, Enyu Zhou, Rui Zheng, Xiaoran Fan,
Xiao Wang, Limao Xiong, Yuhao Zhou, Weiran
Wang, Changhao Jiang, Yicheng Zou, Xiangyang
Liu, Zhangyue Yin, Shihan Dou, Rongxiang Weng,

5758
A Model Architecture Comparison with Here is the record:
Previous Work
We compare T RIAGE AGENT with representative C Dataset Construction
previous works, as shown in Figure 5. We refer to
our system as a heterogeneous multi-agent frame- The dataset was constructed by extracting relevant
work because each agent represents a different role, patient cases from the ESI handbook v4, focus-
creating a role-playing heterogeneous structure. ing on a comprehensive range of clinical scenar-
This structure simulates real-world collaboration ios. Each case was carefully reviewed and labeled
among different roles, with each agent (or role) hav- by professional human experts to ensure accuracy.
ing its unique function and task, thereby improving Each case was carefully reviewed and labeled by
the overall system’s decision-making efficiency and professional human experts to ensure accuracy. The
accuracy. The changing colors of nodes in the fig- dataset is divided into a training set and three test
ure illustrate our dynamic heterogeneous nature, sets (test-1, test-2, and test-3), with the training set
representing agents’ evolving perspectives based containing 180 cases and each test set containing
on different rounds of discussions. In the second 72 cases. We maintained fixed proportions of ESI
row, nodes represent agents at different time steps, levels in the test sets as follows: ESI-1 (12), ESI-2
arrows indicate the edges, and colors signify the (20), ESI-3 (13), ESI-4 (12), and ESI-5 (15).
roles of the agents. For the training scenarios, we have a total of 218
cases with the following distribution across ESI
Node A node represents an agent at a specific levels: ESI-1 (14), ESI-2 (92), ESI-3 (65), ESI-
time step, each with a unique role in a hetero- 4 (22), and ESI-5 (25). The explanations in the
geneous role-playing structure. This setup simu- training dataset are manually annotated by human
lates real-world collaboration, enhancing decision- experts. Detailed proportions of each ESI level in
making efficiency and accuracy by assigning dis- the training and test sets are provided in Table 1.
tinct functions and tasks to each agent. Addition- Each training and test set includes cases with de-
ally, the changing node colors in the figure high- tailed clinical notes and corresponding ESI levels.
light our framework’s dynamic nature, illustrating Although the test sets do not contain explicit expla-
how agents’ perspectives evolve as they assimilate nations for each label, they have been accurately
and process new information through subsequent annotated by human experts based on the clinical
discussion rounds. information provided. The lack of explanations in
the test set is due to the scarcity of medical data and
Edge Edges represent the communication chan- the high cost of manual labeling. This highlights
nels between nodes during multi-agent collabora- the advantage of our framework in reducing human
tion, illustrating how information flows between labor costs and improving the efficiency of medical
agents and through the system. In our LLM-agent- text classification.
based feed-forward network, these edges show how
agents share information across different time steps C.1 Recruitment and Payment
to generate the final answer for the task query.
The human experts assisting us in the classification
of medical documents are professional experts who
B Query Details
voluntarily participated in our comparative study
below: as collaborators. We did not provide them with any
query q: What is the ESI level of the following additional payments or benefits.
clinic record? Please give me a final unique answer
after a second revision of your first proposed C.2 Instructions Given to Participants
answer. You can learn from the Emergency We invited professional human experts to serve as
Severity Index handbook v4. Double-check the human annotators. The full text of the instructions
ESI handbook and ask yourself again(two-round given to participants is: "This study aims to eval-
self-check)when you are sure about this ESI level uate the effectiveness of a generative AI model in
classification before you give me the answer. Then predicting ESI levels and compare its performance
classify the following medical record according with traditional nurse triage. Your participation will
to ESI level, candidate answers are ESI-1,2,3,4,5. help us understand the potential of AI in augment-
5759
Figure 5: Topology Structure Comparisons of previous methods

ing emergency department workflows and improv- immediately;wait time is acceptable.


ing patient care. All test results are de-identified
and will only be used for the purposes of this re-
search study. Please do not look up answers or use D.2 Evaluation Metrics
any additional resources to complete the test as that The Total discordance is calculated as the ratio of
can negatively impact the validity of this study." the total number of misclassified texts to the total
number of texts, representing the overall error rate
C.3 Dataset Copyright of the model. This metric is given by the formula:
We have confirmed that the data comes from the
Total Misclassifications
publicly available ESI Handbook and complies Total discordance = (1)
Total number of texts
with fair use policies.
where Total misclassifications is the number of
D ESI Hierarchy and Evaluation Metrics queries incorrectly classified by the model, and
Total number of texts is the total number of queries
D.1 Hierarchy of ESI levels
analyzed. The Undertriage rate is defined as the
The ESI (Emergency Severity Index) levels classify fraction of instances where the predicted label is
medical events by urgency, from 1 (most urgent) greater than the true label, which is calculated as:
to 5 (least urgent). Medical personnel can refer
Number of predictionsTrue_labels
to the medical event in the ESI handbook with Undertriage =
Total number of texts
(2)
their medical experience to quickly categorize the
current emergency medical event according to the Similarly, the Overtriage rate is defined as the
ESI classification. fraction of instances where the predicted label is
•ESI-1: Most Urgent; Immediate life-saving less than the true label:
intervention Overtriage =
Number of predictionsTrue_labels
(3)
•ESI-2: High urgency; Potentially life-threatening, Total number of texts
prompt attention necessary. The Significant Undertriage rate captures the
•ESI-3: Urgent; Requires multiple resources but scenarios where the true label is critical (ESI 1
not immediately life-threatening. or2) but the prediction underestimates the urgency
•ESI-4: Less urgent; Requires one resource; not (predicted as 3, 4, or 5):
immediately life-threatening.
Predicted-3, 4, or 5
•ESI-5: Least Urgent; No resources needed Significant Undertriage =
Total number of texts
(4)

5760
Method Zero-shot(CoT) F Case Study
Total discordance 56.25%
Undertriage 3.48% To better illustrate the issue, we have excerpted
Significant undertriage 3.48% parts of the complete conversation as key examples
Overtriage 52.78% to more effectively explain the corresponding types
Significant overtriage 35.42% of errors.
Method Fine-Tuned
Total discordance 27.78% •Type-1: Lack of Document Understanding
Undertriage 10.65%
Significant undertriage 7.64% Admin (to chat_manager):
Overtriage 17.13% What is the ESI level of the following clinic record?
Significant overtriage 6.94% Please give me a final unique answer after a second
Table 5: Performance of LlaMa-3-8B on our clinical triage revision of your first proposed answer. The
task. Candidate answers: ESI-1,2,3,4,5. Double-check
the clinic notes and related medical knowledge and
ask yourself again (two-round self-check) when
Lastly, the Significant Overtriage rate addresses you are sure about this ESL level classification
the instances where less critical labels (true label before you give me the answer. Here is the record:“
is 2, 3, or 4) are overestimated by the model (pre- A healthy 10-year-old male is brought to the
dicted as 1): emergency department by his mother, who reports
that her son has not moved his bowels for a week.
Predicted-1,or 5
Significant Overtriage = (5) He is complaining of 7/10 generalized abdominal
Total number of texts
pain, nausea, and lack of appetite. Vital signs: BP
107/66, HR 75, RR 20, T 98.6°F, SpO2 99%”.
E Performance of Different Backbone
LLMs Agent_1 (to chat_manager):
Based on the clinical record provided, the most
Using Different Language Models In our ex- appropriate Emergency Severity Index (ESI)
periments, we utilized GPT-3.5-turbo 2 , GPT-4 3 , level for this patient would be ESI level 3. This
their combination, and LLaMa-2-7B4 and Llama- decision is based on the patient’s presenting
3-8B 5 .When GPT-3.5-turbo and GPT-4 engaged complaints of abdominal pain, nausea, and lack
in interactive debates on AI-Triage tasks, we ob- of appetite, along with the vital signs indicating
served performance improvements of 31.88% for stable blood pressure, heart rate, respiratory rate,
GPT-3.5-turbo and 24.6% for GPT-4. In a zero-shot temperature, and oxygen saturation. While the
scenario, LLaMa-2-7B performed 17.025% worse patient’s symptoms are concerning and require
than LLaMa-3-8B. We also fine-tuned LlaMa-3-8B evaluation, there are no immediate life-threatening
using training scenarios from the ESI handbook to issues or unstable vital signs that would warrant a
enhance its domain knowledge. Compared to GPT- higher ESI level.
4, which achieved a performance improvement of Confidence score: 70%.
23.50% on the same training dataset, LLaMa-3-
8B showed a performance improvement of 27.78%, Agent_2 (to chat_manager):
making it 4.28% worse than GPT-4.Detailed results Based on a thorough review of the clinical record
for LLaMa-3 are presented in Table 5, demonstrat- and consideration of the patient’s presentation,
ing that its performance on clinical triage tasks was vital signs, and symptoms, I would revise the ESI
inferior to that of the GPT models. For details level for this patient to ESI level 2. The patient’s
about the performance of LlaMa-3 on our task, see complaints of abdominal pain, nausea, lack of
Table 5 appetite, and constipation for a week warrant
urgent evaluation, but the stable vital signs indicate
2
https://openai.com/index/ that there are no immediate life-threatening issues.
gpt-3-5-turbo-fine-tuning-and-api-updates/
3
https://openai.com/index/gpt-4/
However, the patient’s symptoms require timely
4
https://llama.meta.com/llama2/ assessment and treatment, justifying a higher
5
https://llama.meta.com/llama3/ acuity level than initially considered.
5761
Confidence score: 85%. ear while I was gardening. I tried to get it out
by using a Q-tip. I just don’t know what else to
Agent_3 (to chat_manager): do, but this buzzing noise is driving me crazy,” a
Based on the information provided in the clinical 55-year-old female tells you. No previous medical
record, a more appropriate Emergency Severity history and vital signs are within normal limits.
Index (ESI) level for this patient would be ESI
level 2. The patient is a 10-year-old male with Agent_1 (to chat_manager):
concerning symptoms of abdominal pain, nausea, Based on the information provided and the
lack of appetite, and constipation for a week. Emergency Severity Index handbook, this patient’s
While the vital signs are stable, the duration of clinic note would be categorized as Low ESI level.
symptoms and the need for urgent evaluation The patient’s condition is not life-threatening but
justify a higher acuity level. This patient may requires medical intervention, which aligns with a
require timely intervention and assessment to lower ESI level.
address the underlying cause of his symptoms, References: - ESI Handbook v4, Chapter 2: ESI
making ESI level 2 more appropriate in this case. Triage Algorithm, p.10-13.
Confidence score: 90%.
Agent_2 (to chat_manager):
Result_summarizer (to chat_manager): The scenario describes a 55-year-old female who
After synthesizing the inputs from the Agent_1, has a foreign object (a bug) in her ear, causing
Agent_2, and Agent_3, the final unique answer for discomfort and a buzzing noise. She has no
the ESI level of the patient in the clinical record previous medical history and her vital signs are
would be ESI level 2. This decision is based on the within normal limits. This is a non-life-threatening
patient being a 10-year-old male presenting with condition, but it does require medical intervention
concerning symptoms of abdominal pain, nausea, to remove the foreign object and alleviate the
lack of appetite, and constipation for a week. patient’s discomfort.
While the vital signs are stable, the duration and The Emergency Severity Index (ESI) is a five-level
nature of the symptoms warrant urgent evaluation emergency department triage algorithm that
and potential intervention, justifying a higher categorizes patients by both acuity and resource
acuity level. needs. ESI level 1 is for patients who need
Confidence score: 92%. immediate life-saving intervention, while ESI level
5 is for patients with non-urgent conditions.
Given the patient’s condition, she would likely be
•Type-2: Mis-retrieval of Domain Knowledge classified as ESI level 4. This level is for patients
with conditions that could potentially escalate to a
This example mis-retrieved the domain knowl- more serious problem requiring emergency inter-
edge from ESI Handbook v4, Chapter 2: ESI vention if left untreated, but are not life-threatening.
Triage Algorithm, p. 10-13. Then this example The patient’s condition is not life-threatening, but
was erroneously categorized as needing urgent it is causing her significant discomfort and could
intervention. potentially lead to complications such as infection
if not addressed.
User_proxy (to chat_manager): The most alarming aspect of this issue is the
What is the ESI level of the following clinic potential for complications if the foreign object is
record according to the Emergency Severity not removed promptly. The use of a Q-tip could
Index handbook? Please give me a final unique potentially push the bug further into the ear canal,
answer after a second revision from your first causing more damage. It’s also noteworthy that
proposed answer. The Candidate answers: the patient’s vital signs are within normal limits,
ESI-1,2,3,4,5. Double Check the clinic notes indicating that she is not in immediate danger.
and related medical knowledge and ask yourself After a second revision, I still believe that the ESI
again (two-round self-check) when you are sure level for this patient would be ESI level 4. The
about this ESL level classification before you patient’s condition is not life-threatening, but it
give me the answer. Here is the record:“ This does require medical intervention.
sounds really strange. A bug flew into my right
5762
Confidence score: 90%. self-check)when you are sure about this ESL level
classification before you give me the answer.Here
Agent_3 (to chat_manager): is the record: “I slipped on the ice, and I hurt my
After reviewing the patient’s condition and the wrist,” reports a 58-year-old female with a history
Emergency Severity Index handbook, I would of migraines. There is no obvious deformity. Vital
classify this patient as ESI level 4 with a confi- signs are within normal limits, and she rates her
dence score of 90%. The patient’s condition is not pain as 5/10.
life-threatening but requires medical intervention
to remove the foreign object from her ear and alle- Response 1 by chat_manager (initial response):
viate her discomfort. This aligns with the criteria Based on the patient’s description and clinical
for ESI level 4, which is for conditions that could information provided, I recommend classifying
potentially escalate to a more serious problem if this patient as ESI level 4. The injury mechanism
left untreated but are not life-threatening. and the pain level, along with normal vital signs,
suggest a non-life-threatening condition.
Agent_2 (to chat_manager): Confidence score: 90%.
Based on the patient’s clinical record and the
Emergency Severity Index handbook, I would Response 2 by chat_manager (second response
classify this patient as ESI level 4 with a confi- after self-check):
dence score of 90%. The patient’s condition is not Upon further reflection and review of the medical
life-threatening but requires medical intervention knowledge, I now assess the patient’s condition
to remove the foreign object from her ear and alle- as ESI level 3. Considering the possibility of a
viate her discomfort. This aligns with the criteria hidden injury due to the nature of the fall, a more
for ESI level 4, which is for conditions that could cautious approach is warranted.
potentially escalate to a more serious problem if Confidence score: 92%.
left untreated but are not life-threatening.
•Type-4: Exchange of Information Errors
Result_summarizer (to chat_manager):
After synthesizing the inputs from the Agent_1,
User_proxy (to chat_manager):
Agent_2, and Agent_3, the consensus is that
What is the ESI level of the following clinical
the patient should be classified as ESI level
record according to the Emergency Severity In-
4. This level is appropriate for patients with
dex handbook? Please provide a final answer after
conditions that could potentially escalate to a
a second revision from your initial proposed an-
more serious problem if left untreated but are not
swer. The candidate answers are: ESI-1, 2, 3, 4, 5.
life-threatening. The patient’s condition of having
Double-check the clinic notes and related medical
a bug in her ear is causing significant discomfort
knowledge before finalizing your decision. Here is
and requires medical intervention, but it is not
the record: “A bug flew into my right ear while I
life-threatening. Therefore, the final decision is
was gardening. I tried to get it out by using a Q-tip.
ESI level 4.
I just don’t know what else to do, but this buzzing
noise is driving me crazy,” a 55-year-old female
•Type-3: Confidence-based Consistency Er- tells you. No previous medical history and vital
rors signs are within normal limits.
This error might be caused by the early-stopping
mechanism and confidence-score passing process. Agent_1 (to chat_manager):
Based on the information provided and the
Admin (to chat_manager): Emergency Severity Index handbook, this patient’s
What is the ESI level of the following clinic clinic note would be categorized as a low ESI level.
record? Please give me a final unique answer The patient’s condition is not life-threatening
after a second revision from your first proposed but requires medical intervention, aligning with
answer. The Candidate answers: ESI-1,2,3,4,5. a lower ESI level,(e.g., ESI level 5), with a
Double Check the clinic notes and related medical confidence score of 80%.
knowledge and ask yourself again(two-round
5763
Agent_2 (to chat_manager):
The scenario describes a 55-year-old female
with a foreign object (a bug) in her ear, causing
discomfort and a buzzing noise. She has no
previous medical history, and her vital signs are
within normal limits. This is a non-life-threatening
condition but requires medical intervention to
remove the foreign object and alleviate the
patient’s discomfort.
Given the patient’s condition, she would likely be
classified as ESI level 4. This level is for patients
with conditions that could potentially escalate to
a more serious problem if left untreated but are
not life-threatening. After a second revision, I still
believe that the ESI level for this patient is ESI
level 4, with a confidence score of 90%.

Agent_3 (to chat_manager):


After reviewing the patient’s condition and the
Emergency Severity Index handbook, I would
classify this patient as ESI level 4 with a confi-
dence score of 90%. The patient’s condition is not
life-threatening but requires medical intervention
to remove the foreign object from her ear and
alleviate her discomfort. This aligns with the
criteria for ESI level 4.

Result_summarizer (to chat_manager):


After synthesizing the inputs from the Agent_1,
Agent_2, and Agent_3, the consensus is that
the patient should be classified as ESI level
4. This level is appropriate for patients with
conditions that could potentially escalate to a
more serious problem if left untreated but are not
life-threatening. The patient’s condition of having
a bug in her ear is causing significant discomfort
and requires medical intervention, but it is not
life-threatening. Therefore, the final decision is
ESI level 4, with a consensus confidence score of
90%.

Explanation: In this example, the inconsistent


classification by Agent_1 (“low ESI level” as
ESI level 5) compared to the other agents (ESI
level 4) could lead to confusion and incorrect
final decisions, representing an Exchange of
Information Error.

5764

You might also like