LLMs-In-The-Loop Part 2 Expert Small AI Models For
LLMs-In-The-Loop Part 2 Expert Small AI Models For
Abstract
The rise of chronic diseases and pandemics like COVID-19 has emphasized the need for
effective patient data processing while ensuring privacy through anonymization and de-
identification of protected health information (PHI). Anonymized data facilitates research
without compromising patient confidentiality.
This paper introduces expert small AI models developed using the LLM-in-the-loop method-
ology to meet the demand for domain-specific de-identification NER models. These models
overcome the privacy risks associated with large language models (LLMs) used via APIs by
eliminating the need to transmit or store sensitive data. More importantly, they consistently
outperform LLMs in de-identification tasks, offering superior performance and reliability.
Our de-identification NER models, developed in eight languages — English, German, Ital-
ian, French, Romanian, Turkish, Spanish, and Arabic — achieved f1-micro score averages
of 0.966, 0.975, 0.976, 0.970, 0.964, 0.974, 0.978, and 0.953 respectively. These results es-
tablish them as the most accurate healthcare anonymization solutions available, surpassing
existing small models and even general-purpose LLMs such as GPT-4o.
While Part-1 of this series introduced the LLM-in-the-loop methodology for bio-medical
document translation, this second paper showcases its success in developing cost-effective
expert small NER models in de-identification tasks. Our findings lay the groundwork for fu-
ture healthcare AI innovations, including biomedical entity and relation extraction, demon-
strating the value of specialized models for domain-specific challenges.
1
1 Introduction
Patient data is essential for improving public health, expanding preventive health services, pre-
venting diseases, and formulating necessary health policies. Recent studies show that almost all
(99%) hospitals in the United States [1] use electronic health records (EHR). Similarly, Wales,
Scotland, Denmark, and Sweden have adopted EHRs in the last few years. However, there is
still a need for nationally accessible health data in the UK. In particular, the Covid-19 pandemic
has again highlighted the importance of EHR data [2]. Thanks to EHRs, disease trends can be
examined, modelling can be done, and health policies can be developed.1.
Technology, which has become more complex and has developed with medical practices, ne-
cessitates the development of methods that will protect patient privacy [3]. With information
security and information leakage recently gaining more importance, patient safety may have
significant consequences beyond ethical violations in fundamental and health law [4].
Personal data is sensitive information that can be associated with an individual and is protected
by various laws [5]. Personal privacy data in healthcare is called PHI and includes private in-
formation such as a patient’s health history, treatments received, etc. [6].
EHRs contain both valuable clinical information and PHI. While EHRs are a rich data source
for research, their usability is restricted due to the confidentiality of PHI [7-10]. For example,
the HIPAA law regulates the use of 18 types of PHIs, such as name, phone number, dates,
etc. (Table 1) [11, 12]. Therefore, PHI must be extracted from the text before EHR data can
be used. Automating de-identification systems is needed since manually extracting PHIs is
time-consuming and costly. In addition, coordination between annotators is also an important
consideration [13, 14]. While early approaches to de-identification relied on complex rules to
detect PHI, recent developments use machine learning methods and train on expert-annotated
records. Hybrid systems integrate practices as features into statistical models, like conditional
random fields (CRFs) [15].
2
Table 1: Protected Health Information Types [11]
No PHI Type
1 Names
2 All geographic subdivisions smaller than a state
3 Dates
4 Telephone numbers
5 Vehicle identifiers
6 Fax numbers
7 Device identifiers and serial numbers
8 Emails
9 URLs
10 Social security numbers
11 Medical record numbers
12 IP addresses
13 Biometric identifiers
14 Health plan beneficiary numbers
15 Full-face photographic images and any comparable images
16 Account numbers
17 Certificate/license numbers
18 Any other unique identifying number, characteristic, or code
The de-identification method makes it possible to use EHRs in research by removing confiden-
tial information [15]. Basic de-identification rules include removing direct identifying state-
ments such as name, date, etc. Advanced statistical methods anonymize the data, thus reducing
the risk of de-identification [16]. However, new techniques may also introduce unknown privacy
risks. Therefore, continuous evaluation and improvement efforts are necessary [17]. Advanced
methods can enable extensive collections of EHRs to be used efficiently and securely in re-
search.
According to HIPAA, there are two possible methods of identity masking. The ”Expert De-
termination” method, which requires employing an expert in the field, involves a small risk in
identifying the individual whose information is used and is carried out using different statistical
methods. In this method, the expert must have sufficient experience and knowledge. The other
method is the ”Safe Harbor” method, which involves de-identifying 18 pre-determined relevant
identifiers that must be removed and/or modified from the corpus [11, 18]. In studies using deep
learning (DL) models, the Safe Harbor method is used, and the relevant PHI is de-identified.
The lack of comprehensive data privacy frameworks can lead to vulnerabilities, leaving sensitive
patient information susceptible to breaches and misuse. Despite efforts to anonymize this data,
reidentification is still feasible through just a few spatiotemporal data points [19]. Recent ad-
vancements in privacy-preserving technologies have seen increased adoption [20], particularly
in artificial intelligence (AI) and big data analytics. These technologies are vital in addressing
major global health challenges by enhancing access to healthcare, promoting health, preventing
diseases, and improving the overall experience for healthcare professionals and patients. AI,
coupled with big data analytics, is the backbone for many innovations in digital health, driv-
ing improvements in care delivery and decision-making processes. Together, these domains are
supported by additional technologies like the Internet of Things (IoT), next-generation networks
(e.g., 5G), and privacy-preserving platforms such as blockchain [21, 22].
3
However, questions remain regarding accountability for AI and LLM outcomes. Since AI lacks
autonomy and sentience, it cannot hold moral responsibility, leaving uncertainty about who
should be accountable for its decisions and actions [23].
From our perspective: the ”LLM-in-the-loop” methodology LLMs is an integral part of the
development process for expert small models, without relying on LLMs as the final solution.
Instead of directly using LLMs for tasks, we utilize them selectively at various stages, such
as synthetic data generation, rigorous evaluation, and agent orchestration, to improve the per-
formance of smaller, domain-specific models. This approach allows us to benefit from the
capabilities of LLMs while keeping the models efficient, focused, and specialized for specific
tasks.
Recently, there has been a growing emphasis on the work done within the scope of LLM-
in-loops. Studies have shown that LLMs perform better on tasks traditionally completed by
humans [24-26], and the potential for effective utilization of LLMs is emphasized. It is seen
that the innovative approach of ”LLM-in-the-loop” is used in different fields today. In a study
conducted to analyze social media content and reveal hidden themes [27]. In the study, the
advanced capabilities of LLMs were leveraged to gain a deeper understanding of social media
content by analyzing social media messages, discovering thematic structures and nuances in
texts, and effectively matching texts to themes. Another study using the LLM-in-loops tech-
nique to improve the performance of LLMs was aimed at continuously improving the model
outputs through iterative feedback loops, and this was applied in a study in the medical field.
The aim was to increase the accuracy and reliability of the model and reduce hallucinations. The
LLM-in-loops study, which involves the process of evaluating the model outputs as such human
experts, giving feedback, and using this feedback to retrain the model, focused on reducing
model errors and obtaining more reliable results in medical question-answer and summariza-
tion tasks [28].
Another study, which examines the potential of LLMs to recognize and examine intertextual
relationships in biblical and Koine Greek texts, highlights how LLMs evaluate different in-
tertextual scenarios and how these models can detect direct quotations, allusions, and echoes
between texts. The study also mentions the ability of LLMs to generate intertextual observa-
tions and connections and the potential of these models to reveal new insights. However, it is
noted that the model has difficulties with long query texts and can create incorrect intertextual
4
connections, which reveals the importance of expert evaluation [29].
We first used the “LLMs-in-loop” method in the context of bio-medical document translation
[30]. In this work, we demonstrate its success in developing cost-effective expert small NER
models for de-identification tasks. Our findings lay the groundwork for future healthcare AI
innovations, including biomedical entity and relation extraction, and demonstrate the value of
specialized models for domain-specific challenges.
2 Background
The de-identification model, called a Named Entity Recognition (NER) classification model,
can be considered under four headings [31]:
• Rule-based models
• Machine learning models
• Hybrid models
• Deep learning models
Techniques such as rule-based models and dictionaries can be easily implemented without la-
bels but are vulnerable to input errors [31-34]. ML methods such as Support Vector Machines
(SVM) and conditional random fields (CRF) can recognize complex patterns but require large
amounts of labelled data and feature engineering and are poor at generalization [35-37]. Hybrid
systems combine rule-based and ML models, providing high accuracy but requiring intensive
feature engineering [38, 39].
Considering the disadvantages of the last three approaches to de-identification system creation,
the latest state-of-the-art systems employ DL techniques to achieve better results than the best
hybrid systems without requiring a time-consuming feature engineering process. DL is an ML
subset using multilayered Artificial Neural Networks (ANN) and is very successful in most
Natural Language Processing (NLP) tasks. Recent advances in DL and NLP (especially in the
field of NER) enable the systems to outperform the winning hybrid system proposed by Yang
and Garibaldi [39] on the 2014 i2b2 de-identification challenge dataset [31, 35].
De-identifying unstructured data is a widely recognized problem [40] in NLP, involving two key
tasks: identifying PHI and replacing it through masking or obfuscation. Research has primarily
5
focused on PHI identification. Early de-identification approaches [41] and [42], especially in
healthcare, were rule-based, using regular expressions, syntactic rules, and specialized dictio-
naries to detect PHI, such as phone numbers and emails. However, they struggled with identi-
fying more complex entities like names and professions and required significant adjustments to
function in different datasets, limiting their flexibility. The 2014 i2b2 project [34] introduced
automatic de-identification, fueling the advancement of the machine and deep learning models
for more accurate PHI detection. Early machine learning methods, such as Conditional Random
Fields (CRF) [43], used hand-crafted features and lexical rules [44], signaling a shift to more
adaptive and scalable approaches.
Work in the de-identification context has achieved human-level accuracy in de-identifying clin-
ical notes from research datasets, but challenges remain in scaling this success to large, real-
world environments. The hybrid context-based model outperformed traditional NER models by
10% in the i2b2-2014 benchmark. It also has significantly fewer errors (93% accuracy) com-
pared to ChatGPT (60% accuracy) [45].
LLM-based methods have been used in the development of de-identification models. However,
these are still in the early stages, and further development is still needed to protect the privacy
and security of health data [46]. The continued need to use APIs in LLM models and the prob-
lem of storing patient data reveals that expert models are still needed.
3 Methodology
This section details the purpose of the research, the datasets employed, the methods for training
and testing, the data preparation process, and the modelling and evaluation phases. Key to this
study is the protection of personal data, adherence to legal regulations, and addressing the risks
associated with processing sensitive patient information.
Our LLM-in-the-loop methodology leverages LLMs at key stages such as synthetic data genera-
tion, labelling, and evaluation, focusing on the development of high-performance, expert small
models. To this end, we used a combination of proprietary closed-source data, open-source
datasets, and synthetic data, all annotated by our labelling team in accordance with i2b2 la-
belling logic. The incorporation of synthetic data and LLM-assisted labelling further enhanced
the scope and quality of our training datasets.
For English-language de-identification NER models, we utilized the entire dataset for training.
The i2b2 test dataset served as the exclusive test set for evaluation purposes, allowing us to
benchmark performance with high precision. For non-English languages, we applied an 80-
20 split for training and testing. Additionally, our medical translation models [30] were used
to translate the English datasets into non-English languages, generating high-quality parallel
datasets across multiple languages.
In the data pre-processing phase, we employed language-specific tools to ensure accurate de-
identification across different languages. The ”Stanza” library was utilized for Romanian-
language tasks, while the ”NLTK” library was used for the other languages. Word tokenization
for all datasets was performed using the ”word-punct tokenizer” from the NLTK library.
6
For evaluation, we adopted the strict evaluation method, where both the chunk and the label had
to match to be considered a correct prediction. This rigorous approach ensures the accuracy and
reliability of our models, particularly in handling PHI.
The results in Table 4 and Table 7 were achieved using a structured and detailed prompt de-
signed to extract Protected Health Information (PHI) from clinical notes. The prompt provided
a comprehensive list of entity definitions, such as AGE, CITY, DEVICE, and ORGANIZA-
TION, along with examples for clarity. It instructed GPT-4o to identify and mark entities using
a consistent tagging format (e.g., BEGINER LABEL CHUNK ENDNER) while preserving the
original text. Specific guidelines were included for nuanced cases, such as excluding titles (e.g.,
”Dr.”) from names and marking only actual dates for the DATE label. This rigorous approach
ensured precision in high-performing categories and highlighted areas for improvement in more
challenging entities. The prompt used in the study is presented in Appendix A.
3.1 Datasets
“i2b2-2014” is a research project 2 on de-identification and heart disease in clinical texts, and
its labelling logic was used in our study. For English-language de-identification NER mod-
els, we utilized a combination of mostly open-source and synthetic data, with 22% derived
from proprietary closed-source data. The i2b2 test dataset served as the exclusive test set for
evaluation, enabling us to benchmark performance with high precision. For non-English lan-
guages, we applied an 80-20 split for training and testing. Most of the non-English datasets
were generated through translation from the English dataset using our medical translation mod-
els [30], open-source and through synthetic data generation with LLM-assisted labelling, pro-
ducing high-quality parallel datasets across multiple languages.
Additionally, we utilized some NLP techniques and open-source third-party tools 3 to enhance
and augment the training datasets.
Although the i2b2 2014 dataset was not utilized for training purposes, we provide relevant in-
formation and statistics here to offer a more comprehensive understanding of its role in our
evaluation process. i2b2/UTHealth is a dataset focused on identifying medical risk factors for
Coronary Artery Disease (CAD) in the medical records of diabetic patients, where risk factors
include hypertension, hyperlipidemia, obesity, smoking status, and family history, as well as
diabetes, CAD, and indicators suggestive of the presence of these diseases [47]. i2b2 dataset
consists of 1,304 progress notes of 296 diabetic patients. All PHIs in the dataset were removed
throughout the study, and de-identification was performed randomly. The PHIs in this dataset
were first categorized into HIPAA categories and then into i2b2-PHI categories, as shown in
Table 2. Overall, the i2b2 dataset contains 56,348 sentences with 984,723 individual tokens, of
2 https://portal.dbmi.hms.harvard.edu
3
LangTest by John Snow Labs: https://langtest.org/
7
which 41,355 are individual PHI tokens representing 28,867 particular PHI instances [31].
In the literature review, it is seen that there are relative limitations in terms of data sets in de-
identification model studies other than English. For this reason, it can be stated that only a few
de-identification models have been developed for different languages.
In this respect, the de-identification models in different languages developed in this study will
contribute to the literature and data scientists working on these models and the health institu-
tions that will use them.
In the study, ten labels were used for the Rule-based method, and 18 labels were used for deep
learning methods. The training dataset was augmented for these labels since ORGANIZATION,
8
PROFESSION, and LOCATION-OTHER entities gave low results due to the first training pro-
cess with the deep learning method. The augmentation stages of the model were performed as
follows:
• Firstly, a fake chunk data frame was created for each label in various formats.
• Each labelled chunk was removed and replaced with label abbreviations.
• The sentences were translated from English to the working language. For the translation,
our medical translation models used [30].
• The label abbreviations in the new sentences were replaced with new chunks of those
labels from the fake data frame.
• This new dataset was converted to “beginning, inside, outside” (BIO) format and added
to the training dataset.
The model’s performance implemented with the DL method used in this study was tested with
the i2b2-2014 test set. It was observed that the retrained dataset with augmented labels showed
better classification results when evaluated using the i2b2 2014 test set [33].
In the de-identification study conducted in English and with the DL method, learning rate=2e-5,
max sentence length=512, batch size=2, and ten epoch train was performed. For the rule-based
method, regexes suitable for each format were created for the selected labels.
The training process was carried out with the obtained data set. In the study, the 0.20 parts of
the dataset determined during the division process were used as the test dataset. The dataset
was preprocessed and converted into BIO format.
The augmentation stages of the other language models were performed as follows:
• In the dataset used for the English in the de-identification model, a fake chunk data frame
was created for each label in various formats.
9
• Each labelled chunk was removed and replaced with label abbreviations
• The sentences were translated from English to the working language. For the translation,
our medical translation models used [30].
• The label abbreviations in the new sentences were replaced with new chunks of those
labels from the fake data frame.
• This new data set was converted to BIO format and added to the train data set.
The de-identification research was performed with the DL method in seven languages other than
English, learning rate=2e-5, max sentence length=512, batch size=16 (batch size=2 in Roma-
nian), and ten epoch trains were performed.
4 Result
The results obtained from the de-identification NER models are shown in Table 4. In addition,
the results obtained by using GPT-4o and the comparison results of other studies using the same
dataset with the results obtained in this study are also included in the same table.
10
As seen in Table 4, the model realized in this study includes PHIs not used in other studies, and
satisfactory results were obtained. When the performance results of the studies are compared
with the results of this study, it is determined that new SOTA values are obtained with this study.
As a result of the analysis and calculations, although the train was performed with 18 PHI labels
(DEVICE and LOCATION-OTHER labels were not used in other studies) and high scores of
some labels were not obtained, the F1 macro score (0.931) obtained in this study was higher
than the other models and a new SOTA value was received.
GPT-4o performs well in classes such as CITY, COUNTRY, ZIP, and STATE, achieving high
precision, recall, and F1-scores. However, it struggles significantly with IDNUM, LOCATION-
OTHER, ORGANIZATION, EMAIL, FAX, and DEVICE, where the scores are notably low.
The macro average (0.5757) indicates that the model’s performance varies significantly across
classes, with weaker performance in certain categories. On the other hand, the micro average
(0.5907) is slightly higher, reflecting the model’s stronger performance in more frequent classes,
but overall, the scores are low.
As a result of the de-identification study in seven different languages, the results obtained for
13 labels in German, Italian, and French are shown in Table 5, while the results obtained for
Turkish (13 labels), Spanish (14 labels), and Romanian (14 labels) are shown in Table 6.
The table presents F1-scores for de-identification tasks across German, Italian, and French
datasets. Overall, the German model achieves the highest macro-average (0.960), followed by
Italian (0.955) and French (0.937). DATE and PHONE categories exhibit consistently strong
performance across all languages, achieving nearly perfect scores (≥ 0.995). In contrast, the
ORGANIZATION category shows notable variability, with the French model scoring signifi-
cantly lower (0.699). These results highlight the robustness of the models in categories such
as AGE, IDNUM, and ZIP while identifying areas for improvement in language-specific chal-
lenges, particularly for underperforming categories like ORGANIZATION in French (Table 5).
However, since it was impossible to find any benchmark tests for these languages, comparing
the scores obtained in this study was impossible.
11
Table 6: Turkish, Spanish, Romanian, and Arabic de-identification Model Outputs
(F1-Score)
Table 6 highlights strong performances for Turkish (macro-avg 0.963) and Spanish (0.957)
models, followed by Romanian (0.930) and Arabic (0.922). Categories like DATE, PHONE,
and MEDICAL RECORD achieve near-perfect scores across languages, demonstrating model
robustness. Lower scores are observed for CITY and ORGANIZATION in Romanian and Ara-
bic, indicating room for improvement. Missing or language-specific labels (e.g., EMAIL, SSN)
show variability in evaluation, reflecting dataset differences. Turkish and Spanish excel in most
categories, with consistent performance across diverse labels.
12
Table 7: i2b2 Test Set Scores (IOB Token Level) Using GPT-4o
Table 7 evaluates the B- (Beginning) and I- (Inside) tags separately, shows that the model
achieves high accuracy (0.9672) overall. Classes like B-STATE, I-CITY, and I-COUNTRY
perform very well, while B-EMAIL, B-FAX, and I-LOCATION have lower precision and re-
call, indicating challenges in identifying these entities. The macro average (0.5775) is lower
than the weighted average (0.968), suggesting that less frequent or more difficult classes are
pulling down the macro scores, whereas the model is quite successful in predicting the more
common entities.
13
The Low Scores Are Attributed to Several Issues The model struggles to identify patient and
doctor names located in the middle of the text, even though it can find those at the beginning and
end. Some hospital names are partially labelled, which affects the overall precision and recall.
Occasionally, the model includes extra tokens within the labels, leading to incorrect annotations.
Despite specifying which labels to use in the prompt, the model sometimes incorrectly adds
different labels, like time, which were not meant to be included. The model confuses some
labels or fails to identify them altogether, contributing to the lower scores.
5 Conclusion
This study underscores the importance of de-identification as a key method for safeguarding pa-
tient/personal health information and ensuring its ethical use in scientific research. By remov-
ing identifiable details through techniques like anonymization, generalization, and differential
privacy, de-identification allows data to be used for diverse scientific applications, including
epidemiological studies, disease modelling, and artificial intelligence development while main-
taining patient privacy.
Recent advancements have demonstrated the potential of LLMs in de-identification tasks, yet
challenges remain, particularly around issues of patient data security, API dependencies, and
the need for domain-specific expertise in handling EHRs. Our ”LLMs-in-the-loop” approach
addresses these concerns by integrating small, specialized models tailored to the medical field.
This method enhances both privacy and reliability, enabling the secure use of data without rely-
ing on external APIs or compromising sensitive patient information.
The multilingual nature of this research, spanning several languages, shows the adaptability
and robustness of our models across diverse healthcare environments. While there are inherent
risks associated with data anonymization, this study demonstrates that when properly applied,
de-identification models can strike a delicate balance between protecting individual privacy and
maximizing the utility of health data.
Ultimately, the findings of this study highlight the potential of expert small models devel-
oped through the LLMs-in-the-loop methodology to meet the evolving demands of health-
care research. The models presented here offer a reliable and scalable solution for future de-
identification applications, advancing the capabilities of AI in healthcare while safeguarding
patient privacy.
Future research should focus on further refining and expanding de-identification models to
cover a wider range of languages and healthcare contexts. One of the primary challenges is the
scarcity of high-quality, annotated datasets in languages other than English, which limits the de-
velopment of robust models for non-English speaking regions. Addressing this gap will require
collaborative efforts to create and share multilingual datasets, ensuring more comprehensive
language coverage. Additionally, future studies could explore more advanced augmentation
14
techniques and develop models capable of handling increasingly complex medical data types,
such as clinical narratives and imaging reports. Continuous innovation in privacy-preserving
methods, such as federated learning, may also prove valuable in safeguarding sensitive patient
information while advancing the performance and applicability of de-identification technolo-
gies across diverse healthcare systems.
References
[1] Ahmed, T., M.M.A. Aziz, and N. Mohammed, De-identification of electronic health
record using neural network, Sci Rep, 2020, 10(1): p. 18600.
[2] Wood, A., et al., Linked electronic health records for research on a nationwide cohort of
more than 54 million people in England: data resource, BMJ, 2021, 373: p. n826.
[3] Gungoren, M., F. Orhan, and N. Kurutkan, Mikro Rekabetçilikte Yeni Yaklaşımlar:
Hastanelerde Oluşan Etik İklimin Kalite ve Akreditasyon Açısından Değerlendirilmesi,
Süleyman Demirel Üniversitesi İktisadi ve İdari Bilimler Fakültesi Dergisi, 2013, 18(1):
p. 221-241.
[4] Varol, Ş., et al., Sağlık kurumlarında bilgi güvenliği bağlamında biyometrik sistemler,
Sağlık Akademisyenleri Dergisi, 2016, 3(4): p. 155-162.
[5] Yilmaz, D., E. Erguner Ozkoc, and G. Ogutcu Ulas, Elektronik Sağlık Kayıtlarında
Farkındalık, 24, 2023.
[6] healthITSecurity, De-Identification of PHI According to the HIPAA Privacy Rule, 2023,
April 13, 2023; Available from: https://healthitsecurity.com/features/de-identification-of-
phi-according-to-the-hipaa-privacy-rule.
[7] Act, A., Health insurance portability and accountability act of 1996, Public law, 1996,
104: p. 191.
[8] Fernandez-Aleman, J.L., et al., Security and privacy in electronic health records: a sys-
tematic literature review, J Biomed Inform, 2013, 46(3): p. 541-62.
[9] Office for Civil Rights, H., Standards for privacy of individually identifiable health infor-
mation. Final rule, Federal register, 2002, 67(157): p. 53181-53273.
[10] Toscano, F., et al., Electronic health records implementation: can the European Union
learn from the United States?, European Journal of Public Health, 2018, 28(suppl 4): p.
cky213. 401.
[11] hhs.gov, Guidance on De-identification of Protected Health Information
- hhs deid guidance.pdf, 2012; [cited 2023 July 17]; Available from:
https://www.hhs.gov/sites/default/files/ocr/privacy/hipaa/understanding/coveredentities/De-
identification/hhs deid guidance.pdf.
[12] hhs.gov, Standards for Privacy of Individually Identifiable Health Info — HHS.gov,
2013; [cited 2023 July 17]; Available from: https://www.hhs.gov/hipaa/for-
professionals/privacy/guidance/standards-privacy-individually-identifiable-health-
information/index.html.
15
[13] Neamatullah, I., et al., Automated de-identification of free-text medical records, BMC Med
Inform Decis Mak, 2008, 8: p. 32.
[14] Paul, T., et al., Investigation of the Utility of Features in a Clinical De-identification
Model: A Demonstration Using EHR Pathology Reports for Advanced NSCLC Patients,
Front Digit Health, 2022, 4: p. 728922.
[16] Wu, H., et al., SemEHR: A general-purpose semantic search system to surface semantic
data from clinical notes for tailored care, trial recruitment, and clinical research, J Am
Med Inform Assoc, 2018, 25(5): p. 530-537.
[17] Stubbs, A. and O. Uzuner, Annotating risk factors for heart disease in clinical narratives
for diabetic patients, J Biomed Inform, 2015, 58 Suppl(Suppl): p. S78-S91.
[18] Catelli, R., et al., A Novel COVID-19 Data Set and an Effective Deep Learning Approach
for the De-Identification of Italian Medical Records, IEEE Access, 2021, 9: p. 19097-
19110.
[19] Reddy, S., et al., A governance model for the application of AI in health care, J Am Med
Inform Assoc, 2020, 27(3): p. 491-497.
[20] Ong, J.C.L., et al., Artificial intelligence, ChatGPT, and other large language models for
social determinants of health: Current state and future directions, Cell Rep Med, 2024,
5(1): p. 101356.
[21] Gunasekeran, D.V., et al., Digital health during COVID-19: lessons from operationalising
new models of care in ophthalmology, Lancet Digit Health, 2021, 3(2): p. e124-e134.
[22] Ting, D.S.W., et al., Digital technology and COVID-19, Nat Med, 2020, 26(4): p. 459-461.
[23] Verdicchio, M. and A. Perin, When Doctors and AI Interact: on Human Responsibility for
Artificial Risks, Philos Technol, 2022, 35(1): p. 11.
[24] Dai, S.-C., A. Xiong, and L.-W. Ku, LLM-in-the-loop: Leveraging large language model
for thematic analysis, arXiv preprint arXiv:2310.15100, 2023.
[25] De Paoli, S., Can Large Language Models emulate an inductive Thematic Analysis of
semi-structured interviews? An exploration and provocation on the limits of the approach
and the model, arXiv preprint arXiv:2305.13014, 2023.
[26] Gilardi, F., M. Alizadeh, and M. Kubli, ChatGPT outperforms crowd workers for text-
annotation tasks, Proc Natl Acad Sci U S A, 2023, 120(30): p. e2305016120.
[27] Islam, T. and D. Goldwasser, Discovering latent themes in social media messaging: A
machine-in-the-loop approach integrating llms, arXiv preprint arXiv:2403.10707, 2024.
[28] Pham, D.K. and B.Q. Vo, Towards Reliable Medical Question Answering: Tech-
niques and Challenges in Mitigating Hallucinations in Language Models, arXiv preprint
arXiv:2408.13808, 2024.
16
[29] Umphrey, R., J. Roberts, and L. Roberts, Investigating Expert-in-the-Loop LLM Discourse
Patterns for Ancient Intertextual Analysis, arXiv preprint arXiv:2409.01882, 2024.
[30] Keles, B., M. Gunay, and S.I. Caglar, LLMs-in-the-loop Part-1: Expert Small AI Models
for Bio-Medical Text Translation, arXiv preprint arXiv:2407.12126, 2024.
[31] Khin, K., P. Burckhardt, and R. Padman, A Deep Learning Architecture for De-
identification of Patient Notes: Implementation and Evaluation, arXiv pre-print server,
2018.
[32] Morrison, F.P., S. Sengupta, and G. Hripcsak, Using a pipeline to improve de-identification
performance, AMIA Annu Symp Proc, 2009. 2009: p. 447–51.
[33] Stubbs, A., C. Kotfila, and O. Uzuner, Automated systems for the de-identification of lon-
gitudinal clinical narratives: Overview of 2014 i2b2/UTHealth shared task Track 1, J
Biomed Inform, 2015, 58 Suppl(Suppl): p. S11-S19.
[34] Uzuner, O., Y. Luo, and P. Szolovits, Evaluating the state-of-the-art in automatic de-
identification, J Am Med Inform Assoc, 2007, 14(5): p. 550–63.
[35] Dernoncourt, F., et al., De-identification of patient notes with recurrent neural networks, J
Am Med Inform Assoc, 2017, 24(3): p. 596–606.
[36] Ferrandez, O., et al., Evaluating current automatic de-identification methods with Vet-
eran’s health administration clinical documents, BMC Med Res Methodol, 2012, 12: p.
109.
[37] Meystre, S.M., et al., Automatic de-identification of textual documents in the electronic
health record: a review of recent research, BMC Med Res Methodol, 2010, 10: p. 70.
[38] Liu, Z., et al., Automatic de-identification of electronic medical records using token-level
and character-level conditional random fields, J Biomed Inform, 2015, 58 Suppl(Suppl):
p. S47-S52.
[39] Yang, H. and J.M. Garibaldi, Automatic detection of protected health information from
clinic narratives, J Biomed Inform, 2015, 58 Suppl(Suppl): p. S30-S38.
[40] Nadkarni, P.M., L. Ohno-Machado, and W.W. Chapman, Natural language processing:
an introduction, J Am Med Inform Assoc, 2011, 18(5): p. 544-51.
[41] Sweeney, L., Replacing personally-identifying information in medical records, the Scrub
system, Proc AMIA Annu Fall Symp, 1996: p. 333-7.
[42] Gupta, D., M. Saul, and J. Gilbertson, Evaluation of a deidentification (De-Id) software
engine to share pathology reports and clinical documents for research, Am J Clin Pathol,
2004, 121(2): p. 176-86.
[43] He, B., et al., CRFs based de-identification of medical records, J Biomed Inform, 2015,
58 Suppl(Suppl): p. S39-S46.
[44] Lafferty, J., A. McCallum, and F. Pereira, Conditional random fields: Probabilistic models
for segmenting and labeling sequence data, in Icml. 2001. Williamstown, MA.
17
[45] Kocaman, V., D. Talby, and H.U. Hak, Beyond Accuracy: Automated De-Identification of
Large Real-World Clinical Text Datasets, Value in Health, 2023, 26(12): p. S532.
[46] Liu, Z., et al., Deid-gpt: Zero-shot medical text de-identification by gpt-4, arXiv preprint
arXiv:2303.11032, 2023.
[47] Stubbs, A., et al., Identifying risk factors for heart disease over time: Overview of 2014
i2b2/UTHealth shared task Track 2, J Biomed Inform, 2015, 58 Suppl(Suppl): p. S67-S77.
18
Appendix A- The Prompt Used to Obtain Benchmarks with GPT-4o
1 prompt = f """ You are tasked with e x t r a c t i n g P r o t e c t e d Health
I n f o r m a t i o n ( PHI ) from clinical notes . Your job is to identify
and mark specific entities within the text . Here are the
entities you need to look for :
2
28 { c l i n i c a l _ n o t e}
29
19
32 * Identify any text that matches one of the PHI entity types
listed above .
33 * For each i d e n t i f i e d PHI entity , mark the b e g i n n i n g and end of
the relevant text chunk using the f o l l o w i n g format :
34 BEGINER_ LABEL CHUNK ENDNER where ENTITY LABEL is one of the
entity types from the list , and CHUNK is the actual text
c o n t a i n i n g the PHI .
35 * While marking , DO NOT EDIT OR CHANGE the original clinical text
, only put marks d e s c r i b e d above .
36
39 Original text :
40 Mrs . Linda Martinez , a 45 - year - old architect , having MR \#:
2775283 for an e v a l u a t i o n on 2023 -05 -10. Her insulin pump model
ZX900 was assessed by Dr . Michael Brown , M . D . The patient ’ s
c o n d i t i o n has improved since the 1990 s , but she m e n t i o n e d
feeling unwell for past 6 months . MF381 /1183 was r e f e r e n c e d
during her visit , which lasted a p p r o x i m a t e l y 5 hours and
c o n c l u d e d at 1 0 : 0 5 : 0 3 . She was d i s c h a r g e d on 2 0 / 1 0 / 2 0 2 3 .
41
42 Marked text :
43 Mrs . B E G I N E R _ P A T I E N T Linda Martinez ENDNER , a B E G I N E R _ A G E 45
ENDNER year - old B E G I N E R _ P R O F E S S I O N a r c h i t e c t ENDNER , having MR
\#: B E G I N E R _ M E D I C A L R E C O R D 2775283 ENDNER for an e v a l u a t i o n on
B E G I N E R _ D A T E 2023 -05 -10 ENDNER . Her insulin pump model
B E G I N E R _ D E V I C E ZX900 ENDNER was assessed by Dr . B E G I N E R _ D O C T O R
Michael Brown ENDNER , M . D . The patient ’ s c o n d i t i o n has improved
since the B E G I N E R _ D A T E 1990 s ENDNER , but she m e n t i o n e d feeling
unwell for past 6 months . B E G I N E R _ I D N U M MF381 /1183 ENDNER was
r e f e r e n c e d during her visit , which lasted a p p r o x i m a t e l y 5 hours
and c o n c l u d e d at 1 0 : 0 5 : 0 3 . She was d i s c h a r g e d on B E G I N E R _ D A T E
2 0 / 1 0 / 2 0 2 3 ENDNER .
44
45 I m p o r t a n t notes :
46 * Be sure to process the entire clinical note and mark all
i n s t a n c e s of PHI entities .
47 * If a chunk of text could belong to multiple entity types ,
choose the most specific or a p p r o p r i a t e one .
48 * Do not mark i n f o r m a t i o n that is not part of the s p e c i f i e d PHI
entity types .
49 * Preserve the original text exactly as it appears , i n c l u d i n g any
spelling errors or f o r m a t t i n g.
50 * Label the data , ensuring that p r o f e s s i o n a l titles or suffixes
such as ’M . D . ’ , ’ Ph . D . ’ , or similar are not removed . These
titles must be p r e s e r v e d exactly as they appear in the text ,
without a l t e r a t i o n or omission and should NEVER be inside the
label .
51 * A p o s t r o p h e ’s ’ ( ’ s ) should not be included within the label
when a s s o c i a t e d with Names . Only the person ’ s name should be
inside the label , and the a p o s t r o p h e ’s ’ should remain outside
20
the marked text . However , a p o s t r o p h e ’s ’ is allowed within the
DATE label when r e f e r r i n g to a decade ( e . g . , 80 ’ s ) .
52 * Mark only specific calendar dates as DATE . Do not mark relative
time e x p r e s s i o n s like "6 months ," "1 year ago ," "5 weeks ," "5
wks ," " yesterday ," " today ," " days ," or similar units of time (
months , years , weeks ) , as they do not r e p r e s e n t actual dates .
53 * Mark only actual dates as DATE . Do not mark time - related
e x p r e s s i o n s such as "10:05:03 ," "10 am ," or d u r a t i o n s like "5
hours " as DATE , since they refer to times or d u r a t i o n s rather
than specific calendar dates .
54 * Fax numbers should be treated as PHONE entities and marked the
same way as phone numbers .
55 Please process the provided clinical note and return it with all
PHI entities a p p r o p r i a t e l y marked .
56 """
21