Can Large Language Models Replace Coding Specialis
Can Large Language Models Replace Coding Specialis
Research Article
Keywords: many-shot learning, in-context learning, meta prompt, GPT-4o, medical coding, clinical trials,
diagnostic-related groups
DOI: https://doi.org/10.21203/rs.3.rs-5750190/v1
License: This work is licensed under a Creative Commons Attribution 4.0 International License.
Read Full License
Abstract
Purpose: Large language Models (LLM), GPT in particular, have demonstrated
near human-level performance in medical domain, from summarizing clinical
notes and passing medical licensing examinations, to predictive tasks such as
disease diagnoses and treatment recommendations. However, currently there is
little research on their efficacy for medical coding, a pivotal component in health
informatics, clinical trials, and reimbursement management. This study proposes
a prompt framework and investigates its effectiveness in medical coding tasks.
Methods: First, a medical coding prompt framework is proposed. This frame-
work aims to improve the performance of complex coding tasks by leveraging
state-of-the-art (SOTA) prompt techniques including meta prompt, multi-shot
learning, and dynamic in-context learning to extract task specific knowledge. This
framework is implemented with a combination of commercial GPT-4o and open-
source LLM. Then its effectiveness is evaluated with three different coding tasks.
Finally, ablation studies are presented to validate and analyze the contribution
of each module in the proposed prompt framework.
Results: On the MIMIC-IV dataset, the prediction accuracy is 68.1% over the
30 most frequent MS-DRG codes. The result is comparable to SOTA 69.4% that
fine-tunes the open-source LLaMA model, to the best of our knowledge. And the
top-5 accuracy is 90.0%. The clinical trial criteria coding task results in a macro
F1 score of 68.4 on the CHIP-CTC test dataset in Chinese, close to 70.9, the best
supervised model training method in comparison. For the less complex semantic
coding task, our method results in a macro F1 score of 79.7 on the CHIP-STS
test dataset in Chinese, which is not competitive with most supervised model
training methods in comparison.
1
Conclusion: This study demonstrates that for complex medical coding tasks,
carefully designed prompt-based learning can achieve similar performance as
SOTA supervised model training approaches. Currently, it can be very help-
ful assistants, but it does not replace human coding specialists. With the rapid
advancement of LLM, their potential to reliably automate complex medical
coding in the near future cannot be underestimated.
1 Introduction
Since OpenAI released ChatGPT two years ago [1], the world’s leading artificial
intelligence (AI) research powerhouses have been relentlessly pushing SOTA in large
language models (LLM). For example, Google introduced Gemini [2], Meta released
open source LLaMA [3], and many more. Taking advantage of this mega wave,
applied AI research based on commercial, proprietary or open source LLM has flour-
ished in many application domains, from biology, medicine, education, and software
engineering to content creation and customer services.
In the medical and healthcare domain, the research literature has shown ChatGPT
to be very effective in comprehending English as well as non-English medical docu-
ments. GPT-4 achieved scores above the passing level in Korean Pharmacist Licensing
Examinations [4]. Subsequently, the enhanced version GPT-4o significantly outper-
formed average medical students in the United States Medical Licensing Examination
[5]. Luo et al. proposed BrainGPT [6] and showed that enhanced LLM exceeded human
neuroscience experts in behavioral prediction tasks that require neuroscience knowl-
edge. In a systematic review of ChatGPT in health care applications [7], Wang et al.
concluded that conversational LLMs perform well in summarizing health-related texts
and answering general medical knowledge questions.
2
model proposed by Li et al. [14]. Using a Transformer-based pre-trained language
model BioBERT [15] as the backbone, Huang et al. proposed PLM-ICD to tackle the
challenges of the task at hand, including long input text and a large prediction label
space [16]. In macro F1 score, the prediction performance of the most frequent 50 ICD
codes in the MIMIC-III dataset reached 0.603 and 0.615 by SWAM and PLM-ICD
respectively, and further enhanced to 0.665 by JLAN.
Another frequently studied task is to predict Diagnostic Related Groups (DRG),
a coding system used to classify hospital cases into groups based on the diagnosis,
treatment, and risk of mortality of the patient. Liu et al. trained a long LSTM model
over discharge summaries in the MIMIC-III dataset which resulted in a macro F1 score
of 0.041 in predicting all DRG codes [17]. The DRGCoder proposed by [18] leveraged
the domain specific pre-trained Transformer ClinicalBERT [19] for the same task and
reported a macro F1 score of 0.101 over all DRG codes.
The vanilla BERT model and many of its variants have also been adopted as a
backbone network for classifying clinical trial eligibility criteria, where deep expertise
is often required for coding. In [20], RoBERTa-large [21] achieved a macro F1 score
of 0.709 to predict an institutional coding schema of 44 categories, where real clinical
trial registration data collected from the Chinese Clinical Trial Registry were used.
In [22], Feng et al. exploited the sentence-T5 [23] as a semantic feature generator to
encode clinical text.
3
1.3 Techniques Improving LLM Performance in General
Domain
When the context window size of LLMs increases rapidly, for example, from 2048
tokens in GPT-3.5 to 8192 tokens in GPT-4, complex tasks can benefit from scaling
up the number of in-context examples in a prompt. With Gemini 1.5 Pro supporting
up to 1 million token lengths, Agarwal et al. investigated the impact of example
numbers on 11 different types of text generation and prediction tasks [30]. The authors
introduced a many-shot in-context regime and reported that when prompting with
example numbers several hundreds to thousands of shots, large performance jumps
occurred, especially on complex reasoning tasks.
In prompt-based learning, the design of prompts can greatly affect task per-
formance. Many prompt engineering methods have been proposed to overcome the
suboptimal nature of hand-crafted task prompts. In [31], Gao et al. proposed a pipeline
that automates prompt template generation and optimization. The authors reported
that their methods outperformed the manual prompt by 11% on average in text
classification tasks. Promptbreeder [32] introduced an optimization workflow that iter-
atively mutates task prompts and evaluates their fitness on a training dataset. Hou
et al. argued that the starting point of searching for optimal task prompts matters.
The authors proposed metaprompting [33], a task-agnostic framework to learn general
meta-knowledge from specific task domains for a better initial task prompt.
2 Methods
Every medical code schema, be it an international or institutional standard, is a tax-
onomy system that comprises definitions of each category in the system and guidelines
on how to assign which category to a given text note. In the paradigm of training a
model for a specific task [14]-[22], the quality and amount of annotated text notes used
to train the prediction model greatly impact task performance. Task specific knowl-
edge such as taxonomy is not exploited. Inspired by the latest developments in prompt
techniques [32-34], this paper proposes a prompt framework specifically designed for
medical coding tasks. The rest of this section first presents the proposal and then
discusses its application to three coding tasks and the corresponding prompt design.
4
Fig. 1 Framework of Medical Coding Task Prompt
5
2.2 Meta Prompt Design
2.2.1 Task 1: Semantic Coding
This task is to categorize a pair of disease-related questions as similar or different. Sim-
ilar means both questions asked about the same disease-related issues. The question
pairs are in Chinese, see Figure 2 for examples of original questions in Chinese and
their English translation, where CHIP-STS is a dataset from the Chinese Biomedical
Language Understanding Evaluation (CBLUE) benchmark [35].
Applying a meta prompt that includes an instruction to translate all original ques-
tions from Chinese to English, a total of 20 classification guidelines are learned from
16,000 training samples in the CHIP-STS dataset. See a few guideline examples Figure
2.
6
Fig. 3 CHIP-CTC Dataset Sample Frequencies by Code Categories
set for a meta prompt to learn a classification guideline. A total of 44 guidelines. See
a couple of examples in Figure 4.
7
was discharged alive or not, with or without major complications and comorbid con-
ditions. We follow the dataset preparation schema in DRG-LLaMA [28] to randomly
divide the MIMIC-IV dataset into 90% for training and 10% for evaluation.
In this study, we evaluate the most frequent 30 DRGs. DRG is an international
standard where each DRG code has a definition serving as a manual for code mapping.
These definitions are short phrases given in [36]. See a couple of definitions in Figure
5. Instead of learning the classification guidelines from the task training set, the meta
prompt of this task translates each DRG definition into more comprehensible cohesive
short paragraphs, as shown in Figure 5.
8
find N the most similar examples from the corresponding training set. The discharge
summaries in task 3 are in English and on average much longer, so sentence-T5-large
[23] model is chosen to embed text.
N refers to the number of many-shot learning examples as shown in the task
prompt box in Figure 1. In section 3, optimal value of N for each task and other
aspects of the evaluation setup will be discussed.
3 Results
3.1 Technical Setup
For evaluating the performance of the prompt framework applied to these coding
tasks, the GPT-4o 2024-08-01 preview version of the Azure OpenAI service was used.
Azure OpenAI service is one of the three online GPT and similar services that are
recommended for the responsible use of the MIMIC dataset [39].
Following related work, the macro F1 and accuracy metrics are used for perfor-
mance assessment in this study. A macro F1 score is an average over F1 scores of each
unique code, and gives a sense of effectiveness on minority classes in tasks 2 and 3,
where data are extremely imbalanced. On the other hand, on average over an entire
testing dataset, a micro F1 score reports a performance potentially skewed towards
majority classes. In multiclass classification tasks, micro F1 is equivalent to accuracy.
9
As shown in Figure 6, when the number of shots increases, the performance
increases significantly in both tasks. In task 1, the best macro F1 score reaches 0.7909
when the number of shots is equal to 40, and the best result in task 2 is 0.8274 when
the number of shots is equal to 20. When the size of many-shot continues to grow,
task 2 performance quickly degrades and task 1 performance also gradually drops.
The STS and CTC test set has 10,000 and 10,193 samples, respectively. Per-
formance of tasks 1 and 2 are evaluated on the CBLUE benchmark platform with
complete test sets. In Table 1, the results of our method are compared with the results
of the leading methods reported in [35], where BERT networks with various improve-
ments were trained with Chinese corpora. Task 1 predicts whether a given question
pair is semantic similar or not. Existing supervised model training approaches outper-
form our prompt-based learning method. Compared to task 1, task 2 is more complex,
where a given sentence is classified into 1 of 44 categories. Our prompt-based learning
method performs on par with existing supervised model training approaches and is
even better than ALBERT-xxlarge.
10
Fig. 7 MIMIC-IV Sample Frequencies by MS-DRGs and Prediction Performance
the corresponding sample frequency. The bottom performing code predictions clutter
at the below 1% frequency region.
4 Discussion
The above section gives an overall performance of the proposed prompt framework
applied to three different medical coding tasks, from relatively simple semantic coding
to more complex clinical trial criteria and MS-DRG coding tasks. We also investigated
whether the same version GPT-4o mini performs on par or better and found that
GPT-4o gives better results than GPT-4o mini. For example, on the CTC dev set
(500), GPT-4o mini produces a result of 0.033 lower than GPT-4o in macro F1 term.
This section first presents ablation studies that validate the effectiveness of each
module in the proposed prompt framework, including the hypothesis that the more
relevant the learning examples to a test sample, the more helpful the LLM will be in
performing the task. Then, potential limitations will be discussed before concluding
the study.
11
4.1 Efficacy of Task Guidelines
To find out whether the classification guidelines learned by meta prompts are helpful
in improving prediction performance, we remove the guidelines from task prompts and
run the experiments again, but with smaller evaluation sets, namely ablation study
sets.
Task 3 benefits greatly from the task guidelines. Without meta-prompt-learning
classification guidelines, a collapse of as large as 0.343 macro F1 score is observed over
an ablation set of 100 samples from the task testing set. The guidelines also help with
the performance of tasks 2 and 1. Without guidelines, the macro F1 score drops 0.056
and 0.030, respectively, over ablation sets: CTC dev set (500) and STS dev set (1000)
as labeled in Figure 6. The results indicate that the more complex the task, the more
helpful the guidelines.
4.3 Limitation
In tasks 1 and 2 when the size of the multi-shot reaches a point, as shown in Figure 6,
the performance of the task is penalized when the multi-shot size continues to increase.
In task 3, as the input data are much larger, GPT-4o supports up to 80 shots in the
task prompt used. However, an ablation study with a multi-shot size of 30 results in
a 0.035 decrease in macro F1 score over its ablation set. This suggests that there is a
potential that the performance of DRG code prediction could benefit from using an
LLM that supports a larger context window size.
For projects consuming large-scale tokens, a limitation of using online LLM services
could be their cost and high latency.
4.4 Conclusions
This study proposes a language-agnostic prompt framework to predict medical codes
with LLM. This framework exploits the latest techniques developed in the prompt-
based learning field to improve the performance of prompt-based learning in complex
tasks, including meta prompt, multi-shot, and dynamic in-context learning. The frame-
work implementation combines the commercial Azure OpenAI GPT-4o service with
12
small open-source LLMs. Subsequently, the effectiveness of this framework is evalu-
ated using different tasks in the context of institutional and standard coding schema.
Ablation studies show the key proposals, extracting task specific knowledge into clas-
sification guidelines and multi-shot dynamic in-context learning are effective, and
drastically lift the performance of complex tasks. Compared to related works that take
a supervised model training approach, our prompt-based learning framework gives
comparative performance on two complex coding tasks but underperforms on rela-
tively simple semantic coding tasks. With the rapid advancement of LLMs, the size of
their context window will increase rapidly. The proposed prompt framework has the
potential to further enhance DRG code prediction performance.
Acknowledgements
This research is supported by Amplify Health Asia.
Funding Declaration
There was no funding.
Declarations
The author has no relevant financial or non-financial interests to disclose.
References
[1] OpenAI: ChatGPT-Release Notes. 30 Nov 2022. https://help.openai.com/en/
articles/6825453-chatgpt-release-notes. Accessed 15 Dec 2024
[2] Google: Introducing Gemini: our largest and most capable AI model. 06 Dec 2023.
https://blog.google/technology/ai/google-gemini-ai/. Accessed 15 Dec 2024
[4] Kyung, J.H., Kim, E.: Performance of gpt-3.5 and gpt-4 on the korean pharmacist
licensing examination: Comparison study. JMIR Medical Education 10:e57451
(2024) https://doi.org/10.2196/57451
[5] Bicknell BT, B.D., et al.: Chatgpt-4 omni performance in usmle disciplines and
clinical skills: Comparative analysis. JMIR Medical Education 10:e63430 (2024)
https://doi.org/10.2196/63430
[6] Xiaoliang lou, A.R., et al.: Large language models surpass human experts in
predicting neuroscience results. Nat Hum Behav (2024) https://doi.org/10.1038/
s41562-024-02046-9
13
[7] Leyao Wang, Z.W., et al.: Conversational large language models in health care:
Systematic review. J Med Internet Res 26:e22769 (2024) https://doi.org/10.
2196/22769
[8] Shepheard, J.: Clinical coding and the quality and integrity of health data. Health
Inf Manag 49:3-4 (2020) https://doi.org/10.1177/1833358319874008
[9] Drabiak, K., Wolfson, J.: What should health care organizations do to reduce
billing fraud and abuse? AMA J Ethics 22(3):221-231 (2020) https://doi.org/
10.1001/amajethics.2020.221
[10] Babre, D.: Medical coding in clinical trials. Perspect Clin Res 1(1):29-32 (2010)
https://doi.org/10.1001/amajethics.2020.221
[11] Shaoxiong Ji, X.L., et al.: A unified review of deep learning for automated medical
coding. ACM Computing Surveys 55(12):1-41 (2024) https://doi.org/10.1145/
3664615
[12] Alistair EW Johnson, T.J.P., et al.: Mimic-iii, a freely accessible critical care
database. Sci Data 3:160035 (2016) https://doi.org/10.1038/sdata.2016.35
[13] Shuyuan Hu, F.T., et al.: An explainable cnn approach for medical codes pre-
diction from clinical text. BMC Med Inform Decis Mak 21:256 (2021) https:
//doi.org/10.1186/s12911-021-01615-6
[14] Xingwang Li, Y.Z., et al.: JLAN: medical code prediction via joint learning atten-
tion networks and denoising mechanism. BMC Bioinformatics 21(1):590 (2021)
https://doi.org/10.1186/s12859-021-04520-x
[15] Jinhyuk Lee, W.Y., et al.: Biobert: a pre-trained biomedical language representa-
tion model for biomedical text mining. Bioinformatics 36(4):1234-1240 (2020)
https://doi.org/10.1093/bioinformatics/btz682
[16] Chao-Wei Huang, S.-C.T., et al.: PLM-ICD: Automatic icd coding with pre-
trained language models. In: Proceedings of the 4th Clinical Natural Language
Processing Workshop, pp. 10–20. Association for Computational Linguistics,
Seattle, WA (2022). https://doi.org/10.18653/v1/2022.clinicalnlp-1.2
[17] Jinghui Liu, D.C., et al.: Early prediction of diagnostic-related groups and esti-
mation of hospital cost by processing clinical notes. npj Digit. Med 4:103 (2021)
https://doi.org/10.1038/s41746-021-00474-9
[18] Daniel Hajialigol, D.K., et al.: DRGCODER: Explainable clinical coding for the
early prediction of diagnostic-related groups. In: Proceedings of the 2023 Confer-
ence on Empirical Methods in Natural Language Processing: System Demonstra-
tions, pp. 373–380. Association for Computational Linguistics, Singapore (2023).
https://doi.org/10.18653/v1/2023.emnlp-demo.34
14
[19] Emily Alsentzer, J.M., et al.: Publicly available clinical bert embeddings. In: Pro-
ceedings of the 2nd Clinical Natural Language Processing Workshop, pp. 72–78.
Association for Computational Linguistics, Minneapolis, Minnesota, USA (2019).
https://doi.org/10.18653/v1/W19-1909
[21] Yinhan Liu, M.O., et al.: RoBERTa: A robustly optimized BERT pretraining
approach. Preprint at https://arxiv.org/abs/1907.11692 (2019)
[22] Feng, Y.: Semantic textual similarity analysis of clinical text in the era of llm.
In: 2024 IEEE Conference on Artificial Intelligence (CAI), pp. 1284–1289 (2024).
https://doi.org/10.1109/CAI59869.2024.00227
[23] Jianmo Ni, G.H.A., et al.: Sentence-T5: Scalable sentence encoders from pre-
trained text-to-text models. In: Findings of the Association for Computational
Linguistics: ACL 2022, pp. 1864–1874. Association for Computational Linguistics,
Dublin, Ireland (2022). https://doi.org/10.18653/v1/2022.findings-acl.146
[24] Nascimento Teixeira, B., Leitão, et al.: Can chatgpt support clinical coding
using the icd-10-cm/pcs? Informatics 11(4):84 (2024) https://doi.org/10.3390/
informatics11040084
[26] Matúš Falis, A.P.G., et al.: Can gpt-3.5 generate and code discharge summaries?
Journal of the American Medical Informatics Association 31(10):2284–2293
(2024) https://doi.org/10.1093/jamia/ocae132
[27] Alistair EW Johnson, L.B., et al.: Mimic-iv, a freely accessible electronic health
record dataset. Sci Data 10 (2023) https://doi.org/10.1038/s41597-022-01899-x
[28] Hanyin Wang, C.G., et al.: DRG-LLaMA: tuning llama model to predict
diagnosis-related group for hospitalized patients. NPJ Digit Med 7:16 (2024)
https://doi.org/10.1038/s41746-023-00989-3
[29] Pengfei Liu, W.Y., et al.: A systematic survey of prompting methods in natural
language processing. ACM Computing Surveys 55(9):1–15 (2023) https://doi.
org/10.1145/3560815
[30] Rishabh Agarwal, A.S., et al.: Many-shot In-Context Learning. Preprint at https:
15
//arxiv.org/abs/2404.11018 (2024)
[31] Gao Tianyu, A. Fisch, et al.: Making pre-trained language models better few-
shot learners. In: Proceedings of the 59th Annual Meeting of the Association for
Computational Linguistics and the 11th International Joint Conference on Nat-
ural Language Processing (Volume 1: Long Papers), pp. 3816–3830. Association
for Computational Linguistics, Online (2021). https://doi.org/10.18653/v1/2021.
acl-long.295
[33] Hou Yutai, D.H., et al.: MetaPrompting: Learning to learn better prompts. In:
Proceedings of the 29th International Conference on Computational Linguistics,
pp. 3251–3262 (2022). https://doi.org/10.48550/arXiv.2209.11486
[35] Ningyu Zhang, M.C., et al.: CBLUE: A chinese biomedical language under-
standing evaluation benchmark. In: Proceedings of the 60th Annual Meeting
of the Association for Computational Linguistics (Volume 1: Long Papers), pp.
7888–7915. Association for Computational Linguistics, Dublin, Ireland (2022).
https://doi.org/10.18653/v1/2022.acl-long.544
[36] Medicare, C., Services, M.: ICD-10-CM/PCS MS-DRG V34.0 Definitions Manu-
als. CMS https://www.cms.gov/ICD10M/version34-fullcode-cms/fullcode cms/
P0001.html. Accessed 15 Dec 2024
[37] Reimers, N., Gurevych, I.: Sentence-BERT: Sentence Embeddings Using Siamese
BERT-Networks. Preprint at https://arxiv.org/abs/1908.10084 (2019)
[39] PhysioNet: Responsible Use of MIMIC Data With Online Services Like GPT.
https://physionet.org/news/post/gpt-responsible-use. Accessed 15 Dec 2024
(2023)
16