0% found this document useful (0 votes)
8 views

Can Large Language Models Replace Coding Specialis

This study evaluates the performance of GPT-4o in medical coding tasks, proposing a prompt framework that leverages advanced techniques like meta prompting and dynamic in-context learning. The results show that while GPT-4o achieves competitive accuracy in complex coding tasks, it does not replace human coding specialists but can serve as a valuable assistant. The findings suggest that with ongoing advancements in large language models, their potential for automating medical coding may improve significantly in the future.

Uploaded by

alvaro riquelme
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views

Can Large Language Models Replace Coding Specialis

This study evaluates the performance of GPT-4o in medical coding tasks, proposing a prompt framework that leverages advanced techniques like meta prompting and dynamic in-context learning. The results show that while GPT-4o achieves competitive accuracy in complex coding tasks, it does not replace human coding specialists but can serve as a valuable assistant. The findings suggest that with ongoing advancements in large language models, their potential for automating medical coding may improve significantly in the future.

Uploaded by

alvaro riquelme
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 17

Can Large Language Models Replace Coding

Specialists? Evaluating GPT Performance in Medical


Coding Tasks
Yeli Feng

Amplify Health Asia

Research Article

Keywords: many-shot learning, in-context learning, meta prompt, GPT-4o, medical coding, clinical trials,
diagnostic-related groups

Posted Date: January 8th, 2025

DOI: https://doi.org/10.21203/rs.3.rs-5750190/v1

License:   This work is licensed under a Creative Commons Attribution 4.0 International License.
Read Full License

Additional Declarations: No competing interests reported.


Can Large Language Models Replace Coding
Specialists? Evaluating GPT Performance in
Medical Coding Tasks
Yeli Feng1*
1* Amplify Health Asia, 21 Collyer Quay, Level 8, Singapore, 049320,
Singapore.

Corresponding author(s). E-mail(s): [email protected];

Abstract
Purpose: Large language Models (LLM), GPT in particular, have demonstrated
near human-level performance in medical domain, from summarizing clinical
notes and passing medical licensing examinations, to predictive tasks such as
disease diagnoses and treatment recommendations. However, currently there is
little research on their efficacy for medical coding, a pivotal component in health
informatics, clinical trials, and reimbursement management. This study proposes
a prompt framework and investigates its effectiveness in medical coding tasks.
Methods: First, a medical coding prompt framework is proposed. This frame-
work aims to improve the performance of complex coding tasks by leveraging
state-of-the-art (SOTA) prompt techniques including meta prompt, multi-shot
learning, and dynamic in-context learning to extract task specific knowledge. This
framework is implemented with a combination of commercial GPT-4o and open-
source LLM. Then its effectiveness is evaluated with three different coding tasks.
Finally, ablation studies are presented to validate and analyze the contribution
of each module in the proposed prompt framework.
Results: On the MIMIC-IV dataset, the prediction accuracy is 68.1% over the
30 most frequent MS-DRG codes. The result is comparable to SOTA 69.4% that
fine-tunes the open-source LLaMA model, to the best of our knowledge. And the
top-5 accuracy is 90.0%. The clinical trial criteria coding task results in a macro
F1 score of 68.4 on the CHIP-CTC test dataset in Chinese, close to 70.9, the best
supervised model training method in comparison. For the less complex semantic
coding task, our method results in a macro F1 score of 79.7 on the CHIP-STS
test dataset in Chinese, which is not competitive with most supervised model
training methods in comparison.

1
Conclusion: This study demonstrates that for complex medical coding tasks,
carefully designed prompt-based learning can achieve similar performance as
SOTA supervised model training approaches. Currently, it can be very help-
ful assistants, but it does not replace human coding specialists. With the rapid
advancement of LLM, their potential to reliably automate complex medical
coding in the near future cannot be underestimated.

Keywords: many-shot learning, in-context learning, meta prompt, GPT-4o, medical


coding, clinical trials, diagnostic-related groups

1 Introduction
Since OpenAI released ChatGPT two years ago [1], the world’s leading artificial
intelligence (AI) research powerhouses have been relentlessly pushing SOTA in large
language models (LLM). For example, Google introduced Gemini [2], Meta released
open source LLaMA [3], and many more. Taking advantage of this mega wave,
applied AI research based on commercial, proprietary or open source LLM has flour-
ished in many application domains, from biology, medicine, education, and software
engineering to content creation and customer services.
In the medical and healthcare domain, the research literature has shown ChatGPT
to be very effective in comprehending English as well as non-English medical docu-
ments. GPT-4 achieved scores above the passing level in Korean Pharmacist Licensing
Examinations [4]. Subsequently, the enhanced version GPT-4o significantly outper-
formed average medical students in the United States Medical Licensing Examination
[5]. Luo et al. proposed BrainGPT [6] and showed that enhanced LLM exceeded human
neuroscience experts in behavioral prediction tasks that require neuroscience knowl-
edge. In a systematic review of ChatGPT in health care applications [7], Wang et al.
concluded that conversational LLMs perform well in summarizing health-related texts
and answering general medical knowledge questions.

1.1 The Role of AI in Medical Coding


The potential of LLM in improving accuracy or explainability in the automation of
medical coding is also being investigated. Medical coding is a process of translating
electronic health records (EHRs) containing free text descriptions about disease diag-
noses, procedures, etc. to alphanumeric codes. They are crucial for effective health
informatics management [8], financial reimbursement [9], and clinical trials [10], as well
as for retrospective analysis of trends in patient population, allocation of resources, and
clinical research. Manual coding is labor intensive, time consuming and error prone,
which leads to a great deal of research exploring the potential of natural language
processing and deep learning techniques to automate this process [11].
Using clinical text notes in the MIMIC-III dataset [12] to predict international
classification of diseases (ICD) codes has been one of the most studied coding tasks. In
[13], Hu et al. proposed SWAM, an ICD code prediction model based on a convolutional
neural network (CNN). JLAN is a bidirectional long-short-term memory (Bi-LSTM)

2
model proposed by Li et al. [14]. Using a Transformer-based pre-trained language
model BioBERT [15] as the backbone, Huang et al. proposed PLM-ICD to tackle the
challenges of the task at hand, including long input text and a large prediction label
space [16]. In macro F1 score, the prediction performance of the most frequent 50 ICD
codes in the MIMIC-III dataset reached 0.603 and 0.615 by SWAM and PLM-ICD
respectively, and further enhanced to 0.665 by JLAN.
Another frequently studied task is to predict Diagnostic Related Groups (DRG),
a coding system used to classify hospital cases into groups based on the diagnosis,
treatment, and risk of mortality of the patient. Liu et al. trained a long LSTM model
over discharge summaries in the MIMIC-III dataset which resulted in a macro F1 score
of 0.041 in predicting all DRG codes [17]. The DRGCoder proposed by [18] leveraged
the domain specific pre-trained Transformer ClinicalBERT [19] for the same task and
reported a macro F1 score of 0.101 over all DRG codes.
The vanilla BERT model and many of its variants have also been adopted as a
backbone network for classifying clinical trial eligibility criteria, where deep expertise
is often required for coding. In [20], RoBERTa-large [21] achieved a macro F1 score
of 0.709 to predict an institutional coding schema of 44 categories, where real clinical
trial registration data collected from the Chinese Clinical Trial Registry were used.
In [22], Feng et al. exploited the sentence-T5 [23] as a semantic feature generator to
encode clinical text.

1.2 Development in the LLM Era


Since the release of GPT, its potential for medical coding has been also explored. In
[24], the free version ChatGPT 2023 was used to encode ICD-10 in more than 150
clinical cases selected from technical books. The authors concluded that ChatGPT
can assist but does not replace medical coders. Using leading LLMs including GPT,
Claude, Gemini, and Llama, Simmons et al. [25] conducted a large scale investigation
with patient notes from the American Health Information Management Association.
GPT-4 achieved the highest 15.2% agreement with human coders. The authors con-
cluded that the LLM under evaluation performed poorly in extracting ICD-10 codes
from hospital notes. Using Azure OpenAI GPT-3.5, Falis et al. [26] conducted a sim-
ilar investigation and reported a macro F1 score of 14.76 on the MIMIC-IV [27] test
dataset. The authors reached the conclusion along the same line that GPT-3.5 with
basic prompting alone is insufficient for ICD-10 coding.
Instead of relying on basic prompting barely, Wang et al. fine-tuned the Llama
model [3] with hospital discharge summaries from the MIMIC-IV dataset for DRG
prediction. The authors reported that their DRG-LLaMA [28] achieved a macro F1
score of 0.327 on all DRGs, surpassing previous leading models in DRG prediction,
including ClinicalBERT. In addition to fine-tuning LLM to improve performance on
specific tasks, prompt-based learning is another paradigm from which many new tech-
niques are emerging, from few-shot learning, in-context learning, and prompt ensemble
to prompt template learning, and many [29].

3
1.3 Techniques Improving LLM Performance in General
Domain
When the context window size of LLMs increases rapidly, for example, from 2048
tokens in GPT-3.5 to 8192 tokens in GPT-4, complex tasks can benefit from scaling
up the number of in-context examples in a prompt. With Gemini 1.5 Pro supporting
up to 1 million token lengths, Agarwal et al. investigated the impact of example
numbers on 11 different types of text generation and prediction tasks [30]. The authors
introduced a many-shot in-context regime and reported that when prompting with
example numbers several hundreds to thousands of shots, large performance jumps
occurred, especially on complex reasoning tasks.
In prompt-based learning, the design of prompts can greatly affect task per-
formance. Many prompt engineering methods have been proposed to overcome the
suboptimal nature of hand-crafted task prompts. In [31], Gao et al. proposed a pipeline
that automates prompt template generation and optimization. The authors reported
that their methods outperformed the manual prompt by 11% on average in text
classification tasks. Promptbreeder [32] introduced an optimization workflow that iter-
atively mutates task prompts and evaluates their fitness on a training dataset. Hou
et al. argued that the starting point of searching for optimal task prompts matters.
The authors proposed metaprompting [33], a task-agnostic framework to learn general
meta-knowledge from specific task domains for a better initial task prompt.

1.4 Aim of the Study


In most medical coding prediction works, GPT-3.5 or GPT-4 was utilized for perfor-
mance evaluation. Whereas GPT-4o [34] released in mid-2024 has a context window
of 128,000 tokens, 14 times larger than that of GPT-4 and 62 times larger than that of
GPT-3.5. This drastic increase in token numbers supports the design of much larger
and more sophisticated prompts to improve performance on complex tasks. This study
proposes a prompt framework for medical coding tasks and uses GPT-4o to evaluate
its performance for medical coding tasks over English and Chinese text.

2 Methods
Every medical code schema, be it an international or institutional standard, is a tax-
onomy system that comprises definitions of each category in the system and guidelines
on how to assign which category to a given text note. In the paradigm of training a
model for a specific task [14]-[22], the quality and amount of annotated text notes used
to train the prediction model greatly impact task performance. Task specific knowl-
edge such as taxonomy is not exploited. Inspired by the latest developments in prompt
techniques [32-34], this paper proposes a prompt framework specifically designed for
medical coding tasks. The rest of this section first presents the proposal and then
discusses its application to three coding tasks and the corresponding prompt design.

4
Fig. 1 Framework of Medical Coding Task Prompt

2.1 Prompt Framework


As shown in Figure 1, in addition to an input text and task instruction, a medical
coding prompt consists of system knowledge and many examples of learning. The
system knowledge has two components: task context that provides an overview of
the task and related medical terminologies; and guidelines that aim to improve text
classification accuracy. For each task, guidelines are learned one time using a meta
prompt to extract common patterns from text and code pairs of the corresponding
train datasets. As the size of a training dataset is too large to fit in one context window,
the guidelines are learned batch by batch and then summarized into a single set. The
guideline learning components are shown in green color boxes in Figure 1. GPT-4o
understands more than 50 different languages. For non-English input text, the meta
prompt first translates them into English and then learns guideline from translated
text.
In [30], Agarwal et al. experimentally showed that prompting a task with several
hundreds to thousands of input and output examples for LLM to learn can significantly
improve performance. The proposed prompt framework adopts this multi-shot regime.
But instead of randomly drawing hundreds of examples from a training dataset, a
dynamic in-context retriever is designed. For each task, the texts in their training
dataset are vectorized at one time using a small sentence-embedding LLM. At testing
time, an input text is vectorized and then a batch of semantically most similar training
examples and corresponding answers are retrieved to form the multi-shot in-context
learning examples as shown in the task prompt box in Figure 1.
We hypothesize that instead of only increasing the absolute number of learning
examples, the more relevant the learning examples to a test input, the more helpful
to LLM performing the task. In the subsequent sections 3 and 4, it is experimentally
validated. The next subsections discuss the application of the above proposed prompt
framework to three medical coding tasks.

5
2.2 Meta Prompt Design
2.2.1 Task 1: Semantic Coding
This task is to categorize a pair of disease-related questions as similar or different. Sim-
ilar means both questions asked about the same disease-related issues. The question
pairs are in Chinese, see Figure 2 for examples of original questions in Chinese and
their English translation, where CHIP-STS is a dataset from the Chinese Biomedical
Language Understanding Evaluation (CBLUE) benchmark [35].
Applying a meta prompt that includes an instruction to translate all original ques-
tions from Chinese to English, a total of 20 classification guidelines are learned from
16,000 training samples in the CHIP-STS dataset. See a few guideline examples Figure
2.

Fig. 2 Task 1: Examples of CHIP-STS Data and Task Guidelines Learned

2.2.2 Task 2: Clinical Trial Criteria Coding


This task consists of categorizing a short Chinese sentence into 1 of 44 categories of
clinical trial criteria. The dataset used here is CHIP-CTC from the CBLUE bench-
mark, where all text data is from the Chinese Clinical Trial Registry. One challenge
of the dataset is that the number of samples in each category is drastically different.
As shown in Figure 3, the categories Disease and Therapy or Surgery constitute more
than 40% of the total samples, and about half of the 44 categories have samples of
less than 1% of the total samples.
For this task, there are a total of 22,962 training samples in the CHIP-CTC dataset.
For each category, the corresponding training examples are selected from the training

6
Fig. 3 CHIP-CTC Dataset Sample Frequencies by Code Categories

set for a meta prompt to learn a classification guideline. A total of 44 guidelines. See
a couple of examples in Figure 4.

Fig. 4 Examples of Guidelines Learned by Meta Prompt

2.2.3 Task 3: MS-DRG Coding


This task consists of predicting MS-DRG codes from hospital discharge summaries.
There are more than 700 unique MS-DRG codes [36]. Each DRG code describes a set
of patient attributes, including the primary diagnosis, specific secondary diagnoses,
procedures, sex, and discharge status. The discharge status tells whether the patient

7
was discharged alive or not, with or without major complications and comorbid con-
ditions. We follow the dataset preparation schema in DRG-LLaMA [28] to randomly
divide the MIMIC-IV dataset into 90% for training and 10% for evaluation.
In this study, we evaluate the most frequent 30 DRGs. DRG is an international
standard where each DRG code has a definition serving as a manual for code mapping.
These definitions are short phrases given in [36]. See a couple of definitions in Figure
5. Instead of learning the classification guidelines from the task training set, the meta
prompt of this task translates each DRG definition into more comprehensible cohesive
short paragraphs, as shown in Figure 5.

Fig. 5 Examples of MS-DRG Definition and Corresponding Guidelines Learned

2.3 Many-shot Dynamic In-Context Retriever


In natural language processing (NLP), vector embedding is a technique that transforms
words, phrases, or texts into numerical vectors so that quantitative metrics can be
used for analysis. Since sentence-BERT [37] set a new SOTA on a range of NLP tasks,
many small LLMs have been made available on Hugging Face. In this study, we curated
two small LLMs for text embedding functionality in the retriever module.
The text data in tasks 1 and 2 are Chinese, so CoSENT [38] trained with Chinese
corpora is used to generate embedding vectors from Chinese sentences and question
pairs from respective datasets. At testing time, each input in its original language is
first transformed into an embedding vector, and then the cosine similarity is used to

8
find N the most similar examples from the corresponding training set. The discharge
summaries in task 3 are in English and on average much longer, so sentence-T5-large
[23] model is chosen to embed text.
N refers to the number of many-shot learning examples as shown in the task
prompt box in Figure 1. In section 3, optimal value of N for each task and other
aspects of the evaluation setup will be discussed.

3 Results
3.1 Technical Setup
For evaluating the performance of the prompt framework applied to these coding
tasks, the GPT-4o 2024-08-01 preview version of the Azure OpenAI service was used.
Azure OpenAI service is one of the three online GPT and similar services that are
recommended for the responsible use of the MIMIC dataset [39].
Following related work, the macro F1 and accuracy metrics are used for perfor-
mance assessment in this study. A macro F1 score is an average over F1 scores of each
unique code, and gives a sense of effectiveness on minority classes in tasks 2 and 3,
where data are extremely imbalanced. On the other hand, on average over an entire
testing dataset, a micro F1 score reports a performance potentially skewed towards
majority classes. In multiclass classification tasks, micro F1 is equivalent to accuracy.

3.2 Performance of Tasks 1 and 2


The CHIP datasets have training, development, and testing datasets, where true class
labels are available only in training and validation sets. The training sets are used to
learn classification guidelines for respective tasks. GPT-4o has a context window size
of 128,000 tokens, it can support hundreds if not a thousand learning examples in the
task prompt.
To find the optimal multi-shot size for each task, a part of development sets,
sample id 1-1000 in task 1 and sample id 1-500 in task 2, were used for grid searching.
Considering the high latency and high cost of the Azure OpenAI service, we use a
subset to determine the optimal hyperparameters in all tasks.

Fig. 6 Comparison of Prediction Performance by Many-shot Size

9
As shown in Figure 6, when the number of shots increases, the performance
increases significantly in both tasks. In task 1, the best macro F1 score reaches 0.7909
when the number of shots is equal to 40, and the best result in task 2 is 0.8274 when
the number of shots is equal to 20. When the size of many-shot continues to grow,
task 2 performance quickly degrades and task 1 performance also gradually drops.

Table 1 Performance Macro-F1 Score on CBLUE Benchmark

Task MacBERT- RoBERTa- ZEN ALBERT- Ours


large large xxlarge
Task 1 (STS test) 85.6 84.7 83.5 84.8 79.7
Task 2 (CTC test) 68.6 70.9 68.6 66.9 68.4

The STS and CTC test set has 10,000 and 10,193 samples, respectively. Per-
formance of tasks 1 and 2 are evaluated on the CBLUE benchmark platform with
complete test sets. In Table 1, the results of our method are compared with the results
of the leading methods reported in [35], where BERT networks with various improve-
ments were trained with Chinese corpora. Task 1 predicts whether a given question
pair is semantic similar or not. Existing supervised model training approaches outper-
form our prompt-based learning method. Compared to task 1, task 2 is more complex,
where a given sentence is classified into 1 of 44 categories. Our prompt-based learning
method performs on par with existing supervised model training approaches and is
even better than ALBERT-xxlarge.

3.3 Performance of Task 3


The most frequent 30 DRGs in the MIMIC-IV test set has 7665 samples, which form
the test set of Task 3. The sample frequency is shown in the top plot of Figure 7. The
length of a discharge summary is much longer than the input data in the previous two
tasks. The maximum many-shot size used for evaluation is 80, the maximum number
that is supported by GPT-4o.
In Table 2, the result of our method over the complete test set is compared with
related works on MIMIC datasets, where IV is an updated version of III. Both CAML
and DRG-LLaMA are supervised model training methods. Their difference is that in
the CAML method, CNN features are pooled using the attention mechanism for each
class. In the latter method, an LLaMA model is fine-tuned for the task.
We can see that the SOTA LLM exhibits a clear superiority over CNN. DRG-
LLaMA outperforms CAML by 0.192 in the accuracy term. When considering the
performance of minority classes as shown in the upper part of Figure 7, DRG-LLaMA
outperforms with a macro F1 score of 0.342. Our prompt-based learning method also
outperforms CAML at a similar level and is on par with DRG-LLaMA, when measured
in terms of accuracy. Top-5 accuracy means that for each test input, five DRG codes
are predicted, respectively.
Measured in macro F1, DRG-LLaMA surpasses our method. In Figure 7, the per-
formance of each DRG code is plotted in the order of their sample frequency. We can
see that there is a weak positive correlation between code prediction performance and

10
Fig. 7 MIMIC-IV Sample Frequencies by MS-DRGs and Prediction Performance

Table 2 Prediction Performance of Most Frequent 30 MS-DRGs

Method Dataset Macro-F1 Accuracy Top-5


Accuracy
CAML [17] MIMIC-III 0.395 0.502 1

DRG-LLaMA-7B [28] MIMIC-IV 0.737 0.694 0.941


Ours MIMIC-IV 0.621 0.681 0.900

1 In CAML, result was reported in micro-F1 score, equivalent to accuracy.

the corresponding sample frequency. The bottom performing code predictions clutter
at the below 1% frequency region.

4 Discussion
The above section gives an overall performance of the proposed prompt framework
applied to three different medical coding tasks, from relatively simple semantic coding
to more complex clinical trial criteria and MS-DRG coding tasks. We also investigated
whether the same version GPT-4o mini performs on par or better and found that
GPT-4o gives better results than GPT-4o mini. For example, on the CTC dev set
(500), GPT-4o mini produces a result of 0.033 lower than GPT-4o in macro F1 term.
This section first presents ablation studies that validate the effectiveness of each
module in the proposed prompt framework, including the hypothesis that the more
relevant the learning examples to a test sample, the more helpful the LLM will be in
performing the task. Then, potential limitations will be discussed before concluding
the study.

11
4.1 Efficacy of Task Guidelines
To find out whether the classification guidelines learned by meta prompts are helpful
in improving prediction performance, we remove the guidelines from task prompts and
run the experiments again, but with smaller evaluation sets, namely ablation study
sets.
Task 3 benefits greatly from the task guidelines. Without meta-prompt-learning
classification guidelines, a collapse of as large as 0.343 macro F1 score is observed over
an ablation set of 100 samples from the task testing set. The guidelines also help with
the performance of tasks 2 and 1. Without guidelines, the macro F1 score drops 0.056
and 0.030, respectively, over ablation sets: CTC dev set (500) and STS dev set (1000)
as labeled in Figure 6. The results indicate that the more complex the task, the more
helpful the guidelines.

4.2 Efficacy of Dynamic In-Context Learning


The dynamic means that the many-shot learning examples in the task prompt are
semantically most similar to a test input. To measure its effect, in this ablation study,
the learning examples are randomly drawn from a task’s training dataset, that is,
without correlation with a test input.
Experiments with the same ablation sets reveal that the dynamic in-context
method is extremely helpful for complex tasks 2 and 3. Without being able to learn
from examples semantically similar to test inputs, macro F1 scores over respective
ablation sets drop as large as 0.152 in task 2 and 0.186 in task 3.
However, both ways give the same performance in Task 1. One possible reason
could be task 1 being relatively simple. It predicts one out of two classes, and the
average words of its input data are much shorter than task 3.

4.3 Limitation
In tasks 1 and 2 when the size of the multi-shot reaches a point, as shown in Figure 6,
the performance of the task is penalized when the multi-shot size continues to increase.
In task 3, as the input data are much larger, GPT-4o supports up to 80 shots in the
task prompt used. However, an ablation study with a multi-shot size of 30 results in
a 0.035 decrease in macro F1 score over its ablation set. This suggests that there is a
potential that the performance of DRG code prediction could benefit from using an
LLM that supports a larger context window size.
For projects consuming large-scale tokens, a limitation of using online LLM services
could be their cost and high latency.

4.4 Conclusions
This study proposes a language-agnostic prompt framework to predict medical codes
with LLM. This framework exploits the latest techniques developed in the prompt-
based learning field to improve the performance of prompt-based learning in complex
tasks, including meta prompt, multi-shot, and dynamic in-context learning. The frame-
work implementation combines the commercial Azure OpenAI GPT-4o service with

12
small open-source LLMs. Subsequently, the effectiveness of this framework is evalu-
ated using different tasks in the context of institutional and standard coding schema.
Ablation studies show the key proposals, extracting task specific knowledge into clas-
sification guidelines and multi-shot dynamic in-context learning are effective, and
drastically lift the performance of complex tasks. Compared to related works that take
a supervised model training approach, our prompt-based learning framework gives
comparative performance on two complex coding tasks but underperforms on rela-
tively simple semantic coding tasks. With the rapid advancement of LLMs, the size of
their context window will increase rapidly. The proposed prompt framework has the
potential to further enhance DRG code prediction performance.

Acknowledgements
This research is supported by Amplify Health Asia.

Funding Declaration
There was no funding.

Declarations
The author has no relevant financial or non-financial interests to disclose.

References
[1] OpenAI: ChatGPT-Release Notes. 30 Nov 2022. https://help.openai.com/en/
articles/6825453-chatgpt-release-notes. Accessed 15 Dec 2024

[2] Google: Introducing Gemini: our largest and most capable AI model. 06 Dec 2023.
https://blog.google/technology/ai/google-gemini-ai/. Accessed 15 Dec 2024

[3] Meta AI: Introducing LLaMA: A foundational 65-billion-parameter


large language model. 16 Mar 2023. https://ai.facebook.com/blog/
large-language-model-llama-meta-ai/. Accessed 15 Dec 2024

[4] Kyung, J.H., Kim, E.: Performance of gpt-3.5 and gpt-4 on the korean pharmacist
licensing examination: Comparison study. JMIR Medical Education 10:e57451
(2024) https://doi.org/10.2196/57451

[5] Bicknell BT, B.D., et al.: Chatgpt-4 omni performance in usmle disciplines and
clinical skills: Comparative analysis. JMIR Medical Education 10:e63430 (2024)
https://doi.org/10.2196/63430

[6] Xiaoliang lou, A.R., et al.: Large language models surpass human experts in
predicting neuroscience results. Nat Hum Behav (2024) https://doi.org/10.1038/
s41562-024-02046-9

13
[7] Leyao Wang, Z.W., et al.: Conversational large language models in health care:
Systematic review. J Med Internet Res 26:e22769 (2024) https://doi.org/10.
2196/22769

[8] Shepheard, J.: Clinical coding and the quality and integrity of health data. Health
Inf Manag 49:3-4 (2020) https://doi.org/10.1177/1833358319874008

[9] Drabiak, K., Wolfson, J.: What should health care organizations do to reduce
billing fraud and abuse? AMA J Ethics 22(3):221-231 (2020) https://doi.org/
10.1001/amajethics.2020.221

[10] Babre, D.: Medical coding in clinical trials. Perspect Clin Res 1(1):29-32 (2010)
https://doi.org/10.1001/amajethics.2020.221

[11] Shaoxiong Ji, X.L., et al.: A unified review of deep learning for automated medical
coding. ACM Computing Surveys 55(12):1-41 (2024) https://doi.org/10.1145/
3664615

[12] Alistair EW Johnson, T.J.P., et al.: Mimic-iii, a freely accessible critical care
database. Sci Data 3:160035 (2016) https://doi.org/10.1038/sdata.2016.35

[13] Shuyuan Hu, F.T., et al.: An explainable cnn approach for medical codes pre-
diction from clinical text. BMC Med Inform Decis Mak 21:256 (2021) https:
//doi.org/10.1186/s12911-021-01615-6

[14] Xingwang Li, Y.Z., et al.: JLAN: medical code prediction via joint learning atten-
tion networks and denoising mechanism. BMC Bioinformatics 21(1):590 (2021)
https://doi.org/10.1186/s12859-021-04520-x

[15] Jinhyuk Lee, W.Y., et al.: Biobert: a pre-trained biomedical language representa-
tion model for biomedical text mining. Bioinformatics 36(4):1234-1240 (2020)
https://doi.org/10.1093/bioinformatics/btz682

[16] Chao-Wei Huang, S.-C.T., et al.: PLM-ICD: Automatic icd coding with pre-
trained language models. In: Proceedings of the 4th Clinical Natural Language
Processing Workshop, pp. 10–20. Association for Computational Linguistics,
Seattle, WA (2022). https://doi.org/10.18653/v1/2022.clinicalnlp-1.2

[17] Jinghui Liu, D.C., et al.: Early prediction of diagnostic-related groups and esti-
mation of hospital cost by processing clinical notes. npj Digit. Med 4:103 (2021)
https://doi.org/10.1038/s41746-021-00474-9

[18] Daniel Hajialigol, D.K., et al.: DRGCODER: Explainable clinical coding for the
early prediction of diagnostic-related groups. In: Proceedings of the 2023 Confer-
ence on Empirical Methods in Natural Language Processing: System Demonstra-
tions, pp. 373–380. Association for Computational Linguistics, Singapore (2023).
https://doi.org/10.18653/v1/2023.emnlp-demo.34

14
[19] Emily Alsentzer, J.M., et al.: Publicly available clinical bert embeddings. In: Pro-
ceedings of the 2nd Clinical Natural Language Processing Workshop, pp. 72–78.
Association for Computational Linguistics, Minneapolis, Minnesota, USA (2019).
https://doi.org/10.18653/v1/W19-1909

[20] Brett R South, V.C.W., et al.: Real-World Use of an Artificial Intelligence-


Supported Solution for Coding of Adverse Events in Clinical Trials.
Applied Cliical Trials https://www.appliedclinicaltrialsonline.com/view/
real-world-use-of-an-artificial-intelligence-supported-solution-for-coding-of-\
adverse-events-in-clinical-trials (2022)

[21] Yinhan Liu, M.O., et al.: RoBERTa: A robustly optimized BERT pretraining
approach. Preprint at https://arxiv.org/abs/1907.11692 (2019)

[22] Feng, Y.: Semantic textual similarity analysis of clinical text in the era of llm.
In: 2024 IEEE Conference on Artificial Intelligence (CAI), pp. 1284–1289 (2024).
https://doi.org/10.1109/CAI59869.2024.00227

[23] Jianmo Ni, G.H.A., et al.: Sentence-T5: Scalable sentence encoders from pre-
trained text-to-text models. In: Findings of the Association for Computational
Linguistics: ACL 2022, pp. 1864–1874. Association for Computational Linguistics,
Dublin, Ireland (2022). https://doi.org/10.18653/v1/2022.findings-acl.146

[24] Nascimento Teixeira, B., Leitão, et al.: Can chatgpt support clinical coding
using the icd-10-cm/pcs? Informatics 11(4):84 (2024) https://doi.org/10.3390/
informatics11040084

[25] Ashley Simmons, K.T., Leitão, et al.: Extracting international classification of


diseases codes from clinical documentation using large language models. Appl
Clin Inform (2024) https://doi.org/10.1055/a-2491-3872

[26] Matúš Falis, A.P.G., et al.: Can gpt-3.5 generate and code discharge summaries?
Journal of the American Medical Informatics Association 31(10):2284–2293
(2024) https://doi.org/10.1093/jamia/ocae132

[27] Alistair EW Johnson, L.B., et al.: Mimic-iv, a freely accessible electronic health
record dataset. Sci Data 10 (2023) https://doi.org/10.1038/s41597-022-01899-x

[28] Hanyin Wang, C.G., et al.: DRG-LLaMA: tuning llama model to predict
diagnosis-related group for hospitalized patients. NPJ Digit Med 7:16 (2024)
https://doi.org/10.1038/s41746-023-00989-3

[29] Pengfei Liu, W.Y., et al.: A systematic survey of prompting methods in natural
language processing. ACM Computing Surveys 55(9):1–15 (2023) https://doi.
org/10.1145/3560815

[30] Rishabh Agarwal, A.S., et al.: Many-shot In-Context Learning. Preprint at https:

15
//arxiv.org/abs/2404.11018 (2024)

[31] Gao Tianyu, A. Fisch, et al.: Making pre-trained language models better few-
shot learners. In: Proceedings of the 59th Annual Meeting of the Association for
Computational Linguistics and the 11th International Joint Conference on Nat-
ural Language Processing (Volume 1: Long Papers), pp. 3816–3830. Association
for Computational Linguistics, Online (2021). https://doi.org/10.18653/v1/2021.
acl-long.295

[32] Chrisantha Fernando, D.B., et al.: Promptbreeder: Self-Referential Self-


Improvement Via Prompt Evolution. Preprint at https://arxiv.org/abs/2309.
16797 (2023)

[33] Hou Yutai, D.H., et al.: MetaPrompting: Learning to learn better prompts. In:
Proceedings of the 29th International Conference on Computational Linguistics,
pp. 3251–3262 (2022). https://doi.org/10.48550/arXiv.2209.11486

[34] Microsoft: Introducing GPT-4o: OpenAI’s new flagship mul-


timodal model now in preview on Azure.ChatGPT-Release
Notes. 13 May 2024. https://azure.microsoft.com/en-us/blog/
introducing-gpt-4o-openais-new-flagship-multimodal-model-now-in-preview-on-\
azure/?msockid=216c337baed0632434f3262aaf8a6292. Accessed 15 Dec 2024

[35] Ningyu Zhang, M.C., et al.: CBLUE: A chinese biomedical language under-
standing evaluation benchmark. In: Proceedings of the 60th Annual Meeting
of the Association for Computational Linguistics (Volume 1: Long Papers), pp.
7888–7915. Association for Computational Linguistics, Dublin, Ireland (2022).
https://doi.org/10.18653/v1/2022.acl-long.544

[36] Medicare, C., Services, M.: ICD-10-CM/PCS MS-DRG V34.0 Definitions Manu-
als. CMS https://www.cms.gov/ICD10M/version34-fullcode-cms/fullcode cms/
P0001.html. Accessed 15 Dec 2024

[37] Reimers, N., Gurevych, I.: Sentence-BERT: Sentence Embeddings Using Siamese
BERT-Networks. Preprint at https://arxiv.org/abs/1908.10084 (2019)

[38] Xu, M.: text2vec: A Tool for Text to Vector. https://github.com/shibing624/


text2vec (2022)

[39] PhysioNet: Responsible Use of MIMIC Data With Online Services Like GPT.
https://physionet.org/news/post/gpt-responsible-use. Accessed 15 Dec 2024
(2023)

16

You might also like