0% found this document useful (0 votes)

8 views

Can Large Language Models Replace Coding Specialis

This study evaluates the performance of GPT-4o in medical coding tasks, proposing a prompt framework that leverages advanced techniques like meta prompting and dynamic in-context learning. The results show that while GPT-4o achieves competitive accuracy in complex coding tasks, it does not replace human coding specialists but can serve as a valuable assistant. The findings suggest that with ongoing advancements in large language models, their potential for automating medical coding may improve significantly in the future.

Uploaded by

alvaro riquelme

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

8 views

Can Large Language Models Replace Coding Specialis

Uploaded by

alvaro riquelme

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 17

Can Large Language Models Replace Coding

Specialists? Evaluating GPT Performance in Medical

Coding Tasks
Yeli Feng

Amplify Health Asia

Research Article

Keywords: many-shot learning, in-context learning, meta prompt, GPT-4o, medical coding, clinical trials,
diagnostic-related groups

Posted Date: January 8th, 2025

DOI: https://doi.org/10.21203/rs.3.rs-5750190/v1

License:   This work is licensed under a Creative Commons Attribution 4.0 International License.
Read Full License

Additional Declarations: No competing interests reported.

Can Large Language Models Replace Coding
Specialists? Evaluating GPT Performance in
Medical Coding Tasks
Yeli Feng1*
1* Amplify Health Asia, 21 Collyer Quay, Level 8, Singapore, 049320,
Singapore.

Corresponding author(s). E-mail(s): [email protected];

Abstract
Purpose: Large language Models (LLM), GPT in particular, have demonstrated
near human-level performance in medical domain, from summarizing clinical
notes and passing medical licensing examinations, to predictive tasks such as
disease diagnoses and treatment recommendations. However, currently there is
little research on their efficacy for medical coding, a pivotal component in health
informatics, clinical trials, and reimbursement management. This study proposes
a prompt framework and investigates its effectiveness in medical coding tasks.
Methods: First, a medical coding prompt framework is proposed. This frame-
work aims to improve the performance of complex coding tasks by leveraging
state-of-the-art (SOTA) prompt techniques including meta prompt, multi-shot
learning, and dynamic in-context learning to extract task specific knowledge. This
framework is implemented with a combination of commercial GPT-4o and open-
source LLM. Then its effectiveness is evaluated with three different coding tasks.
Finally, ablation studies are presented to validate and analyze the contribution
of each module in the proposed prompt framework.
Results: On the MIMIC-IV dataset, the prediction accuracy is 68.1% over the
30 most frequent MS-DRG codes. The result is comparable to SOTA 69.4% that
fine-tunes the open-source LLaMA model, to the best of our knowledge. And the
top-5 accuracy is 90.0%. The clinical trial criteria coding task results in a macro
F1 score of 68.4 on the CHIP-CTC test dataset in Chinese, close to 70.9, the best
supervised model training method in comparison. For the less complex semantic
coding task, our method results in a macro F1 score of 79.7 on the CHIP-STS
test dataset in Chinese, which is not competitive with most supervised model
training methods in comparison.

1
Conclusion: This study demonstrates that for complex medical coding tasks,
carefully designed prompt-based learning can achieve similar performance as
SOTA supervised model training approaches. Currently, it can be very help-
ful assistants, but it does not replace human coding specialists. With the rapid
advancement of LLM, their potential to reliably automate complex medical
coding in the near future cannot be underestimated.

Keywords: many-shot learning, in-context learning, meta prompt, GPT-4o, medical

coding, clinical trials, diagnostic-related groups

1 Introduction
Since OpenAI released ChatGPT two years ago [1], the world’s leading artificial
intelligence (AI) research powerhouses have been relentlessly pushing SOTA in large
language models (LLM). For example, Google introduced Gemini [2], Meta released
open source LLaMA [3], and many more. Taking advantage of this mega wave,
applied AI research based on commercial, proprietary or open source LLM has flour-
ished in many application domains, from biology, medicine, education, and software
engineering to content creation and customer services.
In the medical and healthcare domain, the research literature has shown ChatGPT
to be very effective in comprehending English as well as non-English medical docu-
ments. GPT-4 achieved scores above the passing level in Korean Pharmacist Licensing
Examinations [4]. Subsequently, the enhanced version GPT-4o significantly outper-
formed average medical students in the United States Medical Licensing Examination
[5]. Luo et al. proposed BrainGPT [6] and showed that enhanced LLM exceeded human
neuroscience experts in behavioral prediction tasks that require neuroscience knowl-
edge. In a systematic review of ChatGPT in health care applications [7], Wang et al.
concluded that conversational LLMs perform well in summarizing health-related texts
and answering general medical knowledge questions.

1.1 The Role of AI in Medical Coding

The potential of LLM in improving accuracy or explainability in the automation of
medical coding is also being investigated. Medical coding is a process of translating
electronic health records (EHRs) containing free text descriptions about disease diag-
noses, procedures, etc. to alphanumeric codes. They are crucial for effective health
informatics management [8], financial reimbursement [9], and clinical trials [10], as well
as for retrospective analysis of trends in patient population, allocation of resources, and
clinical research. Manual coding is labor intensive, time consuming and error prone,
which leads to a great deal of research exploring the potential of natural language
processing and deep learning techniques to automate this process [11].
Using clinical text notes in the MIMIC-III dataset [12] to predict international
classification of diseases (ICD) codes has been one of the most studied coding tasks. In
[13], Hu et al. proposed SWAM, an ICD code prediction model based on a convolutional
neural network (CNN). JLAN is a bidirectional long-short-term memory (Bi-LSTM)

2
model proposed by Li et al. [14]. Using a Transformer-based pre-trained language
model BioBERT [15] as the backbone, Huang et al. proposed PLM-ICD to tackle the
challenges of the task at hand, including long input text and a large prediction label
space [16]. In macro F1 score, the prediction performance of the most frequent 50 ICD
codes in the MIMIC-III dataset reached 0.603 and 0.615 by SWAM and PLM-ICD
respectively, and further enhanced to 0.665 by JLAN.
Another frequently studied task is to predict Diagnostic Related Groups (DRG),
a coding system used to classify hospital cases into groups based on the diagnosis,
treatment, and risk of mortality of the patient. Liu et al. trained a long LSTM model
over discharge summaries in the MIMIC-III dataset which resulted in a macro F1 score
of 0.041 in predicting all DRG codes [17]. The DRGCoder proposed by [18] leveraged
the domain specific pre-trained Transformer ClinicalBERT [19] for the same task and
reported a macro F1 score of 0.101 over all DRG codes.
The vanilla BERT model and many of its variants have also been adopted as a
backbone network for classifying clinical trial eligibility criteria, where deep expertise
is often required for coding. In [20], RoBERTa-large [21] achieved a macro F1 score
of 0.709 to predict an institutional coding schema of 44 categories, where real clinical
trial registration data collected from the Chinese Clinical Trial Registry were used.
In [22], Feng et al. exploited the sentence-T5 [23] as a semantic feature generator to
encode clinical text.

1.2 Development in the LLM Era

Since the release of GPT, its potential for medical coding has been also explored. In
[24], the free version ChatGPT 2023 was used to encode ICD-10 in more than 150
clinical cases selected from technical books. The authors concluded that ChatGPT
can assist but does not replace medical coders. Using leading LLMs including GPT,
Claude, Gemini, and Llama, Simmons et al. [25] conducted a large scale investigation
with patient notes from the American Health Information Management Association.
GPT-4 achieved the highest 15.2% agreement with human coders. The authors con-
cluded that the LLM under evaluation performed poorly in extracting ICD-10 codes
from hospital notes. Using Azure OpenAI GPT-3.5, Falis et al. [26] conducted a sim-
ilar investigation and reported a macro F1 score of 14.76 on the MIMIC-IV [27] test
dataset. The authors reached the conclusion along the same line that GPT-3.5 with
basic prompting alone is insufficient for ICD-10 coding.
Instead of relying on basic prompting barely, Wang et al. fine-tuned the Llama
model [3] with hospital discharge summaries from the MIMIC-IV dataset for DRG
prediction. The authors reported that their DRG-LLaMA [28] achieved a macro F1
score of 0.327 on all DRGs, surpassing previous leading models in DRG prediction,
including ClinicalBERT. In addition to fine-tuning LLM to improve performance on
specific tasks, prompt-based learning is another paradigm from which many new tech-
niques are emerging, from few-shot learning, in-context learning, and prompt ensemble
to prompt template learning, and many [29].

3
1.3 Techniques Improving LLM Performance in General
Domain
When the context window size of LLMs increases rapidly, for example, from 2048
tokens in GPT-3.5 to 8192 tokens in GPT-4, complex tasks can benefit from scaling
up the number of in-context examples in a prompt. With Gemini 1.5 Pro supporting
up to 1 million token lengths, Agarwal et al. investigated the impact of example
numbers on 11 different types of text generation and prediction tasks [30]. The authors
introduced a many-shot in-context regime and reported that when prompting with
example numbers several hundreds to thousands of shots, large performance jumps
occurred, especially on complex reasoning tasks.
In prompt-based learning, the design of prompts can greatly affect task per-
formance. Many prompt engineering methods have been proposed to overcome the
suboptimal nature of hand-crafted task prompts. In [31], Gao et al. proposed a pipeline
that automates prompt template generation and optimization. The authors reported
that their methods outperformed the manual prompt by 11% on average in text
classification tasks. Promptbreeder [32] introduced an optimization workflow that iter-
atively mutates task prompts and evaluates their fitness on a training dataset. Hou
et al. argued that the starting point of searching for optimal task prompts matters.
The authors proposed metaprompting [33], a task-agnostic framework to learn general
meta-knowledge from specific task domains for a better initial task prompt.

1.4 Aim of the Study

In most medical coding prediction works, GPT-3.5 or GPT-4 was utilized for perfor-
mance evaluation. Whereas GPT-4o [34] released in mid-2024 has a context window
of 128,000 tokens, 14 times larger than that of GPT-4 and 62 times larger than that of
GPT-3.5. This drastic increase in token numbers supports the design of much larger
and more sophisticated prompts to improve performance on complex tasks. This study
proposes a prompt framework for medical coding tasks and uses GPT-4o to evaluate
its performance for medical coding tasks over English and Chinese text.

2 Methods
Every medical code schema, be it an international or institutional standard, is a tax-
onomy system that comprises definitions of each category in the system and guidelines
on how to assign which category to a given text note. In the paradigm of training a
model for a specific task [14]-[22], the quality and amount of annotated text notes used
to train the prediction model greatly impact task performance. Task specific knowl-
edge such as taxonomy is not exploited. Inspired by the latest developments in prompt
techniques [32-34], this paper proposes a prompt framework specifically designed for
medical coding tasks. The rest of this section first presents the proposal and then
discusses its application to three coding tasks and the corresponding prompt design.

4
Fig. 1 Framework of Medical Coding Task Prompt

2.1 Prompt Framework

As shown in Figure 1, in addition to an input text and task instruction, a medical
coding prompt consists of system knowledge and many examples of learning. The
system knowledge has two components: task context that provides an overview of
the task and related medical terminologies; and guidelines that aim to improve text
classification accuracy. For each task, guidelines are learned one time using a meta
prompt to extract common patterns from text and code pairs of the corresponding
train datasets. As the size of a training dataset is too large to fit in one context window,
the guidelines are learned batch by batch and then summarized into a single set. The
guideline learning components are shown in green color boxes in Figure 1. GPT-4o
understands more than 50 different languages. For non-English input text, the meta
prompt first translates them into English and then learns guideline from translated
text.
In [30], Agarwal et al. experimentally showed that prompting a task with several
hundreds to thousands of input and output examples for LLM to learn can significantly
improve performance. The proposed prompt framework adopts this multi-shot regime.
But instead of randomly drawing hundreds of examples from a training dataset, a
dynamic in-context retriever is designed. For each task, the texts in their training
dataset are vectorized at one time using a small sentence-embedding LLM. At testing
time, an input text is vectorized and then a batch of semantically most similar training
examples and corresponding answers are retrieved to form the multi-shot in-context
learning examples as shown in the task prompt box in Figure 1.
We hypothesize that instead of only increasing the absolute number of learning
examples, the more relevant the learning examples to a test input, the more helpful
to LLM performing the task. In the subsequent sections 3 and 4, it is experimentally
validated. The next subsections discuss the application of the above proposed prompt
framework to three medical coding tasks.

5
2.2 Meta Prompt Design
2.2.1 Task 1: Semantic Coding
This task is to categorize a pair of disease-related questions as similar or different. Sim-
ilar means both questions asked about the same disease-related issues. The question
pairs are in Chinese, see Figure 2 for examples of original questions in Chinese and
their English translation, where CHIP-STS is a dataset from the Chinese Biomedical
Language Understanding Evaluation (CBLUE) benchmark [35].
Applying a meta prompt that includes an instruction to translate all original ques-
tions from Chinese to English, a total of 20 classification guidelines are learned from
16,000 training samples in the CHIP-STS dataset. See a few guideline examples Figure
2.

Fig. 2 Task 1: Examples of CHIP-STS Data and Task Guidelines Learned

2.2.2 Task 2: Clinical Trial Criteria Coding

This task consists of categorizing a short Chinese sentence into 1 of 44 categories of
clinical trial criteria. The dataset used here is CHIP-CTC from the CBLUE bench-
mark, where all text data is from the Chinese Clinical Trial Registry. One challenge
of the dataset is that the number of samples in each category is drastically different.
As shown in Figure 3, the categories Disease and Therapy or Surgery constitute more
than 40% of the total samples, and about half of the 44 categories have samples of
less than 1% of the total samples.
For this task, there are a total of 22,962 training samples in the CHIP-CTC dataset.
For each category, the corresponding training examples are selected from the training

6
Fig. 3 CHIP-CTC Dataset Sample Frequencies by Code Categories

set for a meta prompt to learn a classification guideline. A total of 44 guidelines. See
a couple of examples in Figure 4.

Fig. 4 Examples of Guidelines Learned by Meta Prompt

2.2.3 Task 3: MS-DRG Coding

This task consists of predicting MS-DRG codes from hospital discharge summaries.
There are more than 700 unique MS-DRG codes [36]. Each DRG code describes a set
of patient attributes, including the primary diagnosis, specific secondary diagnoses,
procedures, sex, and discharge status. The discharge status tells whether the patient

7
was discharged alive or not, with or without major complications and comorbid con-
ditions. We follow the dataset preparation schema in DRG-LLaMA [28] to randomly
divide the MIMIC-IV dataset into 90% for training and 10% for evaluation.
In this study, we evaluate the most frequent 30 DRGs. DRG is an international
standard where each DRG code has a definition serving as a manual for code mapping.
These definitions are short phrases given in [36]. See a couple of definitions in Figure
5. Instead of learning the classification guidelines from the task training set, the meta
prompt of this task translates each DRG definition into more comprehensible cohesive
short paragraphs, as shown in Figure 5.

Fig. 5 Examples of MS-DRG Definition and Corresponding Guidelines Learned

2.3 Many-shot Dynamic In-Context Retriever

In natural language processing (NLP), vector embedding is a technique that transforms
words, phrases, or texts into numerical vectors so that quantitative metrics can be
used for analysis. Since sentence-BERT [37] set a new SOTA on a range of NLP tasks,
many small LLMs have been made available on Hugging Face. In this study, we curated
two small LLMs for text embedding functionality in the retriever module.
The text data in tasks 1 and 2 are Chinese, so CoSENT [38] trained with Chinese
corpora is used to generate embedding vectors from Chinese sentences and question
pairs from respective datasets. At testing time, each input in its original language is
first transformed into an embedding vector, and then the cosine similarity is used to

8
find N the most similar examples from the corresponding training set. The discharge
summaries in task 3 are in English and on average much longer, so sentence-T5-large
[23] model is chosen to embed text.
N refers to the number of many-shot learning examples as shown in the task
prompt box in Figure 1. In section 3, optimal value of N for each task and other
aspects of the evaluation setup will be discussed.

3 Results
3.1 Technical Setup
For evaluating the performance of the prompt framework applied to these coding
tasks, the GPT-4o 2024-08-01 preview version of the Azure OpenAI service was used.
Azure OpenAI service is one of the three online GPT and similar services that are
recommended for the responsible use of the MIMIC dataset [39].
Following related work, the macro F1 and accuracy metrics are used for perfor-
mance assessment in this study. A macro F1 score is an average over F1 scores of each
unique code, and gives a sense of effectiveness on minority classes in tasks 2 and 3,
where data are extremely imbalanced. On the other hand, on average over an entire
testing dataset, a micro F1 score reports a performance potentially skewed towards
majority classes. In multiclass classification tasks, micro F1 is equivalent to accuracy.

3.2 Performance of Tasks 1 and 2

The CHIP datasets have training, development, and testing datasets, where true class
labels are available only in training and validation sets. The training sets are used to
learn classification guidelines for respective tasks. GPT-4o has a context window size
of 128,000 tokens, it can support hundreds if not a thousand learning examples in the
task prompt.
To find the optimal multi-shot size for each task, a part of development sets,
sample id 1-1000 in task 1 and sample id 1-500 in task 2, were used for grid searching.
Considering the high latency and high cost of the Azure OpenAI service, we use a
subset to determine the optimal hyperparameters in all tasks.

Fig. 6 Comparison of Prediction Performance by Many-shot Size

9
As shown in Figure 6, when the number of shots increases, the performance
increases significantly in both tasks. In task 1, the best macro F1 score reaches 0.7909
when the number of shots is equal to 40, and the best result in task 2 is 0.8274 when
the number of shots is equal to 20. When the size of many-shot continues to grow,
task 2 performance quickly degrades and task 1 performance also gradually drops.

Table 1 Performance Macro-F1 Score on CBLUE Benchmark

Task MacBERT- RoBERTa- ZEN ALBERT- Ours

large large xxlarge
Task 1 (STS test) 85.6 84.7 83.5 84.8 79.7
Task 2 (CTC test) 68.6 70.9 68.6 66.9 68.4

The STS and CTC test set has 10,000 and 10,193 samples, respectively. Per-
formance of tasks 1 and 2 are evaluated on the CBLUE benchmark platform with
complete test sets. In Table 1, the results of our method are compared with the results
of the leading methods reported in [35], where BERT networks with various improve-
ments were trained with Chinese corpora. Task 1 predicts whether a given question
pair is semantic similar or not. Existing supervised model training approaches outper-
form our prompt-based learning method. Compared to task 1, task 2 is more complex,
where a given sentence is classified into 1 of 44 categories. Our prompt-based learning
method performs on par with existing supervised model training approaches and is
even better than ALBERT-xxlarge.

3.3 Performance of Task 3

The most frequent 30 DRGs in the MIMIC-IV test set has 7665 samples, which form
the test set of Task 3. The sample frequency is shown in the top plot of Figure 7. The
length of a discharge summary is much longer than the input data in the previous two
tasks. The maximum many-shot size used for evaluation is 80, the maximum number
that is supported by GPT-4o.
In Table 2, the result of our method over the complete test set is compared with
related works on MIMIC datasets, where IV is an updated version of III. Both CAML
and DRG-LLaMA are supervised model training methods. Their difference is that in
the CAML method, CNN features are pooled using the attention mechanism for each
class. In the latter method, an LLaMA model is fine-tuned for the task.
We can see that the SOTA LLM exhibits a clear superiority over CNN. DRG-
LLaMA outperforms CAML by 0.192 in the accuracy term. When considering the
performance of minority classes as shown in the upper part of Figure 7, DRG-LLaMA
outperforms with a macro F1 score of 0.342. Our prompt-based learning method also
outperforms CAML at a similar level and is on par with DRG-LLaMA, when measured
in terms of accuracy. Top-5 accuracy means that for each test input, five DRG codes
are predicted, respectively.
Measured in macro F1, DRG-LLaMA surpasses our method. In Figure 7, the per-
formance of each DRG code is plotted in the order of their sample frequency. We can
see that there is a weak positive correlation between code prediction performance and

10
Fig. 7 MIMIC-IV Sample Frequencies by MS-DRGs and Prediction Performance

Table 2 Prediction Performance of Most Frequent 30 MS-DRGs

Method Dataset Macro-F1 Accuracy Top-5

Accuracy
CAML [17] MIMIC-III 0.395 0.502 1

DRG-LLaMA-7B [28] MIMIC-IV 0.737 0.694 0.941

Ours MIMIC-IV 0.621 0.681 0.900

1 In CAML, result was reported in micro-F1 score, equivalent to accuracy.

the corresponding sample frequency. The bottom performing code predictions clutter
at the below 1% frequency region.

4 Discussion
The above section gives an overall performance of the proposed prompt framework
applied to three different medical coding tasks, from relatively simple semantic coding
to more complex clinical trial criteria and MS-DRG coding tasks. We also investigated
whether the same version GPT-4o mini performs on par or better and found that
GPT-4o gives better results than GPT-4o mini. For example, on the CTC dev set
(500), GPT-4o mini produces a result of 0.033 lower than GPT-4o in macro F1 term.
This section first presents ablation studies that validate the effectiveness of each
module in the proposed prompt framework, including the hypothesis that the more
relevant the learning examples to a test sample, the more helpful the LLM will be in
performing the task. Then, potential limitations will be discussed before concluding
the study.

11
4.1 Efficacy of Task Guidelines
To find out whether the classification guidelines learned by meta prompts are helpful
in improving prediction performance, we remove the guidelines from task prompts and
run the experiments again, but with smaller evaluation sets, namely ablation study
sets.
Task 3 benefits greatly from the task guidelines. Without meta-prompt-learning
classification guidelines, a collapse of as large as 0.343 macro F1 score is observed over
an ablation set of 100 samples from the task testing set. The guidelines also help with
the performance of tasks 2 and 1. Without guidelines, the macro F1 score drops 0.056
and 0.030, respectively, over ablation sets: CTC dev set (500) and STS dev set (1000)
as labeled in Figure 6. The results indicate that the more complex the task, the more
helpful the guidelines.

4.2 Efficacy of Dynamic In-Context Learning

The dynamic means that the many-shot learning examples in the task prompt are
semantically most similar to a test input. To measure its effect, in this ablation study,
the learning examples are randomly drawn from a task’s training dataset, that is,
without correlation with a test input.
Experiments with the same ablation sets reveal that the dynamic in-context
method is extremely helpful for complex tasks 2 and 3. Without being able to learn
from examples semantically similar to test inputs, macro F1 scores over respective
ablation sets drop as large as 0.152 in task 2 and 0.186 in task 3.
However, both ways give the same performance in Task 1. One possible reason
could be task 1 being relatively simple. It predicts one out of two classes, and the
average words of its input data are much shorter than task 3.

4.3 Limitation
In tasks 1 and 2 when the size of the multi-shot reaches a point, as shown in Figure 6,
the performance of the task is penalized when the multi-shot size continues to increase.
In task 3, as the input data are much larger, GPT-4o supports up to 80 shots in the
task prompt used. However, an ablation study with a multi-shot size of 30 results in
a 0.035 decrease in macro F1 score over its ablation set. This suggests that there is a
potential that the performance of DRG code prediction could benefit from using an
LLM that supports a larger context window size.
For projects consuming large-scale tokens, a limitation of using online LLM services
could be their cost and high latency.

4.4 Conclusions
This study proposes a language-agnostic prompt framework to predict medical codes
with LLM. This framework exploits the latest techniques developed in the prompt-
based learning field to improve the performance of prompt-based learning in complex
tasks, including meta prompt, multi-shot, and dynamic in-context learning. The frame-
work implementation combines the commercial Azure OpenAI GPT-4o service with

12
small open-source LLMs. Subsequently, the effectiveness of this framework is evalu-
ated using different tasks in the context of institutional and standard coding schema.
Ablation studies show the key proposals, extracting task specific knowledge into clas-
sification guidelines and multi-shot dynamic in-context learning are effective, and
drastically lift the performance of complex tasks. Compared to related works that take
a supervised model training approach, our prompt-based learning framework gives
comparative performance on two complex coding tasks but underperforms on rela-
tively simple semantic coding tasks. With the rapid advancement of LLMs, the size of
their context window will increase rapidly. The proposed prompt framework has the
potential to further enhance DRG code prediction performance.

Acknowledgements
This research is supported by Amplify Health Asia.

Funding Declaration
There was no funding.

Declarations
The author has no relevant financial or non-financial interests to disclose.

References
[1] OpenAI: ChatGPT-Release Notes. 30 Nov 2022. https://help.openai.com/en/
articles/6825453-chatgpt-release-notes. Accessed 15 Dec 2024

[2] Google: Introducing Gemini: our largest and most capable AI model. 06 Dec 2023.
https://blog.google/technology/ai/google-gemini-ai/. Accessed 15 Dec 2024

[3] Meta AI: Introducing LLaMA: A foundational 65-billion-parameter

large language model. 16 Mar 2023. https://ai.facebook.com/blog/
large-language-model-llama-meta-ai/. Accessed 15 Dec 2024

[4] Kyung, J.H., Kim, E.: Performance of gpt-3.5 and gpt-4 on the korean pharmacist
licensing examination: Comparison study. JMIR Medical Education 10:e57451
(2024) https://doi.org/10.2196/57451

[5] Bicknell BT, B.D., et al.: Chatgpt-4 omni performance in usmle disciplines and
clinical skills: Comparative analysis. JMIR Medical Education 10:e63430 (2024)
https://doi.org/10.2196/63430

[6] Xiaoliang lou, A.R., et al.: Large language models surpass human experts in
predicting neuroscience results. Nat Hum Behav (2024) https://doi.org/10.1038/
s41562-024-02046-9

13
[7] Leyao Wang, Z.W., et al.: Conversational large language models in health care:
Systematic review. J Med Internet Res 26:e22769 (2024) https://doi.org/10.
2196/22769

[8] Shepheard, J.: Clinical coding and the quality and integrity of health data. Health
Inf Manag 49:3-4 (2020) https://doi.org/10.1177/1833358319874008

[9] Drabiak, K., Wolfson, J.: What should health care organizations do to reduce
billing fraud and abuse? AMA J Ethics 22(3):221-231 (2020) https://doi.org/
10.1001/amajethics.2020.221

[10] Babre, D.: Medical coding in clinical trials. Perspect Clin Res 1(1):29-32 (2010)
https://doi.org/10.1001/amajethics.2020.221

[11] Shaoxiong Ji, X.L., et al.: A unified review of deep learning for automated medical
coding. ACM Computing Surveys 55(12):1-41 (2024) https://doi.org/10.1145/
3664615

[12] Alistair EW Johnson, T.J.P., et al.: Mimic-iii, a freely accessible critical care
database. Sci Data 3:160035 (2016) https://doi.org/10.1038/sdata.2016.35

[13] Shuyuan Hu, F.T., et al.: An explainable cnn approach for medical codes pre-
diction from clinical text. BMC Med Inform Decis Mak 21:256 (2021) https:
//doi.org/10.1186/s12911-021-01615-6

[14] Xingwang Li, Y.Z., et al.: JLAN: medical code prediction via joint learning atten-
tion networks and denoising mechanism. BMC Bioinformatics 21(1):590 (2021)
https://doi.org/10.1186/s12859-021-04520-x

[15] Jinhyuk Lee, W.Y., et al.: Biobert: a pre-trained biomedical language representa-
tion model for biomedical text mining. Bioinformatics 36(4):1234-1240 (2020)
https://doi.org/10.1093/bioinformatics/btz682

[16] Chao-Wei Huang, S.-C.T., et al.: PLM-ICD: Automatic icd coding with pre-
trained language models. In: Proceedings of the 4th Clinical Natural Language
Processing Workshop, pp. 10–20. Association for Computational Linguistics,
Seattle, WA (2022). https://doi.org/10.18653/v1/2022.clinicalnlp-1.2

[17] Jinghui Liu, D.C., et al.: Early prediction of diagnostic-related groups and esti-
mation of hospital cost by processing clinical notes. npj Digit. Med 4:103 (2021)
https://doi.org/10.1038/s41746-021-00474-9

[18] Daniel Hajialigol, D.K., et al.: DRGCODER: Explainable clinical coding for the
early prediction of diagnostic-related groups. In: Proceedings of the 2023 Confer-
ence on Empirical Methods in Natural Language Processing: System Demonstra-
tions, pp. 373–380. Association for Computational Linguistics, Singapore (2023).
https://doi.org/10.18653/v1/2023.emnlp-demo.34

14
[19] Emily Alsentzer, J.M., et al.: Publicly available clinical bert embeddings. In: Pro-
ceedings of the 2nd Clinical Natural Language Processing Workshop, pp. 72–78.
Association for Computational Linguistics, Minneapolis, Minnesota, USA (2019).
https://doi.org/10.18653/v1/W19-1909

[20] Brett R South, V.C.W., et al.: Real-World Use of an Artificial Intelligence-

Supported Solution for Coding of Adverse Events in Clinical Trials.
Applied Cliical Trials https://www.appliedclinicaltrialsonline.com/view/
real-world-use-of-an-artificial-intelligence-supported-solution-for-coding-of-\
adverse-events-in-clinical-trials (2022)

[21] Yinhan Liu, M.O., et al.: RoBERTa: A robustly optimized BERT pretraining
approach. Preprint at https://arxiv.org/abs/1907.11692 (2019)

[22] Feng, Y.: Semantic textual similarity analysis of clinical text in the era of llm.
In: 2024 IEEE Conference on Artificial Intelligence (CAI), pp. 1284–1289 (2024).
https://doi.org/10.1109/CAI59869.2024.00227

[23] Jianmo Ni, G.H.A., et al.: Sentence-T5: Scalable sentence encoders from pre-
trained text-to-text models. In: Findings of the Association for Computational
Linguistics: ACL 2022, pp. 1864–1874. Association for Computational Linguistics,
Dublin, Ireland (2022). https://doi.org/10.18653/v1/2022.findings-acl.146

[24] Nascimento Teixeira, B., Leitão, et al.: Can chatgpt support clinical coding
using the icd-10-cm/pcs? Informatics 11(4):84 (2024) https://doi.org/10.3390/
informatics11040084

[25] Ashley Simmons, K.T., Leitão, et al.: Extracting international classification of

diseases codes from clinical documentation using large language models. Appl
Clin Inform (2024) https://doi.org/10.1055/a-2491-3872

[26] Matúš Falis, A.P.G., et al.: Can gpt-3.5 generate and code discharge summaries?
Journal of the American Medical Informatics Association 31(10):2284–2293
(2024) https://doi.org/10.1093/jamia/ocae132

[27] Alistair EW Johnson, L.B., et al.: Mimic-iv, a freely accessible electronic health
record dataset. Sci Data 10 (2023) https://doi.org/10.1038/s41597-022-01899-x

[28] Hanyin Wang, C.G., et al.: DRG-LLaMA: tuning llama model to predict
diagnosis-related group for hospitalized patients. NPJ Digit Med 7:16 (2024)
https://doi.org/10.1038/s41746-023-00989-3

[29] Pengfei Liu, W.Y., et al.: A systematic survey of prompting methods in natural
language processing. ACM Computing Surveys 55(9):1–15 (2023) https://doi.
org/10.1145/3560815

[30] Rishabh Agarwal, A.S., et al.: Many-shot In-Context Learning. Preprint at https:

15
//arxiv.org/abs/2404.11018 (2024)

[31] Gao Tianyu, A. Fisch, et al.: Making pre-trained language models better few-
shot learners. In: Proceedings of the 59th Annual Meeting of the Association for
Computational Linguistics and the 11th International Joint Conference on Nat-
ural Language Processing (Volume 1: Long Papers), pp. 3816–3830. Association
for Computational Linguistics, Online (2021). https://doi.org/10.18653/v1/2021.
acl-long.295

[32] Chrisantha Fernando, D.B., et al.: Promptbreeder: Self-Referential Self-

Improvement Via Prompt Evolution. Preprint at https://arxiv.org/abs/2309.
16797 (2023)

[33] Hou Yutai, D.H., et al.: MetaPrompting: Learning to learn better prompts. In:
Proceedings of the 29th International Conference on Computational Linguistics,
pp. 3251–3262 (2022). https://doi.org/10.48550/arXiv.2209.11486

[34] Microsoft: Introducing GPT-4o: OpenAI’s new flagship mul-

timodal model now in preview on Azure.ChatGPT-Release
Notes. 13 May 2024. https://azure.microsoft.com/en-us/blog/
introducing-gpt-4o-openais-new-flagship-multimodal-model-now-in-preview-on-\
azure/?msockid=216c337baed0632434f3262aaf8a6292. Accessed 15 Dec 2024

[35] Ningyu Zhang, M.C., et al.: CBLUE: A chinese biomedical language under-
standing evaluation benchmark. In: Proceedings of the 60th Annual Meeting
of the Association for Computational Linguistics (Volume 1: Long Papers), pp.
7888–7915. Association for Computational Linguistics, Dublin, Ireland (2022).
https://doi.org/10.18653/v1/2022.acl-long.544

[36] Medicare, C., Services, M.: ICD-10-CM/PCS MS-DRG V34.0 Definitions Manu-
als. CMS https://www.cms.gov/ICD10M/version34-fullcode-cms/fullcode cms/
P0001.html. Accessed 15 Dec 2024

[37] Reimers, N., Gurevych, I.: Sentence-BERT: Sentence Embeddings Using Siamese
BERT-Networks. Preprint at https://arxiv.org/abs/1908.10084 (2019)

[38] Xu, M.: text2vec: A Tool for Text to Vector. https://github.com/shibing624/

text2vec (2022)

[39] PhysioNet: Responsible Use of MIMIC Data With Online Services Like GPT.
https://physionet.org/news/post/gpt-responsible-use. Accessed 15 Dec 2024
(2023)

5
No ratings yet
5
9 pages
Exploring LLM Mul.-Agents for ICD Coding
No ratings yet
Exploring LLM Mul.-Agents for ICD Coding
18 pages
Diagnostic reasoning prompts reveal the potential for large
No ratings yet
Diagnostic reasoning prompts reveal the potential for large
7 pages
Radiology-GPT A Large Language Model For Radiology
No ratings yet
Radiology-GPT A Large Language Model For Radiology
16 pages
Kim 21 A
No ratings yet
Kim 21 A
12 pages
Large Language Models Finetuned with Diverse Medical Data and Comprehensive Evaluation
No ratings yet
Large Language Models Finetuned with Diverse Medical Data and Comprehensive Evaluation
11 pages
PDF - 3
No ratings yet
PDF - 3
17 pages
Journal Pdig 0000198
No ratings yet
Journal Pdig 0000198
12 pages
Paper Medicina LLMs
No ratings yet
Paper Medicina LLMs
21 pages
Using Natural Language Processing To Evaluate The Impact of Specialized Transformers Models On Medical Domain Tasks
No ratings yet
Using Natural Language Processing To Evaluate The Impact of Specialized Transformers Models On Medical Domain Tasks
9 pages
A Survey of Large Language Models in Medicine - Principles, Applications, and Challenges
No ratings yet
A Survey of Large Language Models in Medicine - Principles, Applications, and Challenges
53 pages
2205.12689-LLM Weak SuperVision Distillation
No ratings yet
2205.12689-LLM Weak SuperVision Distillation
26 pages
JAMA Health 1695421152
No ratings yet
JAMA Health 1695421152
3 pages
1__2024_Towards Next-Generation Medical Agent_How o1 is Reshaping Decision-Making in Medical Scenarios
No ratings yet
1__2024_Towards Next-Generation Medical Agent_How o1 is Reshaping Decision-Making in Medical Scenarios
40 pages
2311.16079v1
No ratings yet
2311.16079v1
38 pages
A Unified Review of Deep Learning For Automated Medical Coding
No ratings yet
A Unified Review of Deep Learning For Automated Medical Coding
42 pages
Chatbot For Healthcare
No ratings yet
Chatbot For Healthcare
6 pages
Evolution and Impact of Large Language Models in Medical practice
No ratings yet
Evolution and Impact of Large Language Models in Medical practice
12 pages
Artical Writting Phase1
No ratings yet
Artical Writting Phase1
2 pages
Distilling Large Language Models For Matching Patients To Clinical Trials
No ratings yet
Distilling Large Language Models For Matching Patients To Clinical Trials
21 pages
Evaluating large language models as agents，自然医学
No ratings yet
Evaluating large language models as agents，自然医学
3 pages
Griot 等 - 2024 - Impact of high-quality, mixed-domain data on the performance of medical language models
No ratings yet
Griot 等 - 2024 - Impact of high-quality, mixed-domain data on the performance of medical language models
9 pages
Decoding ChatGPT A Primer On Large Language Models For Clinicians
No ratings yet
Decoding ChatGPT A Primer On Large Language Models For Clinicians
4 pages
2411.03590v1
No ratings yet
2411.03590v1
25 pages
LLMs
No ratings yet
LLMs
26 pages
2402.10373
No ratings yet
2402.10373
17 pages
Bao 等 - 2023 - DISC-MedLLM Bridging General Large Language Models and Real-World Medical Consultation
No ratings yet
Bao 等 - 2023 - DISC-MedLLM Bridging General Large Language Models and Real-World Medical Consultation
18 pages
7
No ratings yet
7
28 pages
Assessing The Utility of ChatGPT Throughout The Entire Clinical Workflow
No ratings yet
Assessing The Utility of ChatGPT Throughout The Entire Clinical Workflow
15 pages
Team_Hackers_medical_chatbot_usingGammaLLMV2_BERTmodels
No ratings yet
Team_Hackers_medical_chatbot_usingGammaLLMV2_BERTmodels
6 pages
LLM Evaluation
No ratings yet
LLM Evaluation
5 pages
Learning to Use Medical Tools with Multi-modal Agent
No ratings yet
Learning to Use Medical Tools with Multi-modal Agent
14 pages
Amplifying Healthcare Chatbot Capabilities Through Llama2, Faiss, and Hugging Face Embeddings For Medical Inquiry Resolution
No ratings yet
Amplifying Healthcare Chatbot Capabilities Through Llama2, Faiss, and Hugging Face Embeddings For Medical Inquiry Resolution
7 pages
LLMs Encode Clinical Knowledge
No ratings yet
LLMs Encode Clinical Knowledge
28 pages
2303.16416
No ratings yet
2303.16416
7 pages
biomedinformatics-04-00047
No ratings yet
biomedinformatics-04-00047
16 pages
3
No ratings yet
3
5 pages
1
No ratings yet
1
48 pages
Preprints202409 0311 v1
No ratings yet
Preprints202409 0311 v1
19 pages
s41586-025-08869-4
No ratings yet
s41586-025-08869-4
19 pages
Transforming Healthcare: Integrating Large-Scale Language Modelling in To Chatbot Systems For Instant Medical Information
No ratings yet
Transforming Healthcare: Integrating Large-Scale Language Modelling in To Chatbot Systems For Instant Medical Information
9 pages
Clinical Trial Management – an Overview
From Everand
Clinical Trial Management – an Overview
Editor IJSMI
No ratings yet
1265-Article Text-14688-1-10-20250129
No ratings yet
1265-Article Text-14688-1-10-20250129
7 pages
USMLE Exam
No ratings yet
USMLE Exam
15 pages
s41587-023-01789-6
No ratings yet
s41587-023-01789-6
2 pages
2412.10849v1
No ratings yet
2412.10849v1
25 pages
JPNR+-+S10+-+672
No ratings yet
JPNR+-+S10+-+672
9 pages
1-s2.0-S2667276624000143-main
No ratings yet
1-s2.0-S2667276624000143-main
10 pages
2308 01727v1
No ratings yet
2308 01727v1
12 pages
18-Generative AI in Medicine and Healthcare-Limitations
No ratings yet
18-Generative AI in Medicine and Healthcare-Limitations
15 pages
Radiology-Llama2 Best-in-Class Large Language Model for Radiology
No ratings yet
Radiology-Llama2 Best-in-Class Large Language Model for Radiology
15 pages
data_uncertainty
No ratings yet
data_uncertainty
15 pages
Modelling Temporal Document Sequences For Clinical ICD Coding
No ratings yet
Modelling Temporal Document Sequences For Clinical ICD Coding
10 pages
Deep learning-based natural language processing for detecting medicalsymptoms and histories in emergency patient triage
No ratings yet
Deep learning-based natural language processing for detecting medicalsymptoms and histories in emergency patient triage
10 pages
Me-Llama: Medical Foundation Large Language Models For Comprehensive Text Analysis and Beyond
No ratings yet
Me-Llama: Medical Foundation Large Language Models For Comprehensive Text Analysis and Beyond
21 pages
Comparing ChatGPT and GPT-4 Performance in USMLE
No ratings yet
Comparing ChatGPT and GPT-4 Performance in USMLE
5 pages
JMDH 413470 Using Chatgpt in Medical Research Current Status and Future
No ratings yet
JMDH 413470 Using Chatgpt in Medical Research Current Status and Future
9 pages
A Survey of Large Language Models in Medicine Progress, Application, and Challenge 2024 [highlight]
No ratings yet
A Survey of Large Language Models in Medicine Progress, Application, and Challenge 2024 [highlight]
29 pages
Radiology LLM
No ratings yet
Radiology LLM
34 pages
2st Review Loki
No ratings yet
2st Review Loki
23 pages
English Speech
No ratings yet
English Speech
3 pages
T5 - Radpro
No ratings yet
T5 - Radpro
19 pages
Mental Status Examination
100% (3)
Mental Status Examination
17 pages
SF 256
No ratings yet
SF 256
2 pages
1996 PeterH - Duesberg Inventingthe AIDSvirus
No ratings yet
1996 PeterH - Duesberg Inventingthe AIDSvirus
724 pages
1rika Konsep Paliatif
No ratings yet
1rika Konsep Paliatif
47 pages
List of Medical Equipment
No ratings yet
List of Medical Equipment
4 pages
ENT Care For CHO & SN - Introduction & Understanding The Structure of ENT
No ratings yet
ENT Care For CHO & SN - Introduction & Understanding The Structure of ENT
18 pages
Happyness and Wellbeing
No ratings yet
Happyness and Wellbeing
30 pages
Circle Time
100% (1)
Circle Time
13 pages
Grade 7 NS Exam June 2022
No ratings yet
Grade 7 NS Exam June 2022
6 pages
Conflict_Resolution_Playbook
No ratings yet
Conflict_Resolution_Playbook
7 pages
9 Gas Exchange 2012-18 - N LQ A Level Biology 9700 Classified by Mr. ADEEL AHMAD
100% (1)
9 Gas Exchange 2012-18 - N LQ A Level Biology 9700 Classified by Mr. ADEEL AHMAD
43 pages
Installation of Safety Barrier, Gates and Painting of Pedestrian Lane
No ratings yet
Installation of Safety Barrier, Gates and Painting of Pedestrian Lane
14 pages
The Child and Adolescent Syllabus
No ratings yet
The Child and Adolescent Syllabus
6 pages
PDF Textbook of Gynecologic Robotic Surgery 1st ed. 2018 Edition download
100% (2)
PDF Textbook of Gynecologic Robotic Surgery 1st ed. 2018 Edition download
41 pages
NLM Module 10
No ratings yet
NLM Module 10
13 pages
Student Assessment Booklet: CHC33015 C III I S
0% (1)
Student Assessment Booklet: CHC33015 C III I S
23 pages
Medico Philosophical Treatise On Mental Alienation Entirely Reworked and Extensively Expanded 1809 Second Edition Philippe Pinel (Auth.)
100% (5)
Medico Philosophical Treatise On Mental Alienation Entirely Reworked and Extensively Expanded 1809 Second Edition Philippe Pinel (Auth.)
73 pages
RAVARI
No ratings yet
RAVARI
6 pages
Sodium Metabisulfite SDS
No ratings yet
Sodium Metabisulfite SDS
8 pages
General-Luna Elem Districtschool NS Baseline Consolidation Sy2023-2024
No ratings yet
General-Luna Elem Districtschool NS Baseline Consolidation Sy2023-2024
99 pages
Brunning The Distinctiveness of Polyamory
No ratings yet
Brunning The Distinctiveness of Polyamory
20 pages
International Journal of Ophthalmology and Clinical Research Ijocr 2 035
No ratings yet
International Journal of Ophthalmology and Clinical Research Ijocr 2 035
5 pages
In House Training Pada Perawat PK I-Pk Iv Terhadap Pengetahuan
No ratings yet
In House Training Pada Perawat PK I-Pk Iv Terhadap Pengetahuan
8 pages
Hotel Evaluation Template - AA-Haile Grand Hotel-1
No ratings yet
Hotel Evaluation Template - AA-Haile Grand Hotel-1
4 pages
1.1.9 Margarine Formulation
No ratings yet
1.1.9 Margarine Formulation
2 pages
Download Complete (Ebook) Transformative Innovation by Graham Leicester ISBN 9781911193821, 1911193821 PDF for All Chapters
100% (3)
Download Complete (Ebook) Transformative Innovation by Graham Leicester ISBN 9781911193821, 1911193821 PDF for All Chapters
81 pages
Consideraciones Sobre La Ética Profesional para El Psicoterapeuta
No ratings yet
Consideraciones Sobre La Ética Profesional para El Psicoterapeuta
12 pages
Surgical Endodontics & Its Current Concepts
No ratings yet
Surgical Endodontics & Its Current Concepts
20 pages