5
5
DOI: 10.1002/hcs2.61
REVIEW
Rui Yang1 | Ting Fang Tan2 | Wei Lu3 | Arun James Thirunavukarasu4 |
Daniel Shu Wei Ting2,5 | Nan Liu5,6
1
Department of Biomedical Informatics,
Yong Loo Lin School of Medicine, Abstract
National University of Singapore, Recently, the emergence of ChatGPT, an artificial intelligence chatbot
Singapore, Singapore
2
developed by OpenAI, has attracted significant attention due to its exceptional
Singapore National Eye Center,
Singapore Eye Research Institute, language comprehension and content generation capabilities, highlighting the
Singapore Health Service, Singapore, immense potential of large language models (LLMs). LLMs have become a
Singapore
burgeoning hotspot across many fields, including health care. Within health
3
StatNLP Research Group, Singapore
care, LLMs may be classified into LLMs for the biomedical domain and LLMs
University of Technology and Design,
Singapore for the clinical domain based on the corpora used for pre‐training. In the last 3
4
University of Cambridge School of years, these domain‐specific LLMs have demonstrated exceptional perform-
Clinical Medicine, Cambridge, UK ance on multiple natural language processing tasks, surpassing the perform-
5
Duke‐NUS Medical School, Centre for
ance of general LLMs as well. This not only emphasizes the significance of
Quantitative Medicine, Singapore,
Singapore developing dedicated LLMs for the specific domains, but also raises
6
Duke‐NUS Medical School, Programme expectations for their applications in health care. We believe that LLMs may
in Health Services and Systems Research, be used widely in preconsultation, diagnosis, and management, with
Singapore, Singapore
appropriate development and supervision. Additionally, LLMs hold tremen-
Correspondence dous promise in assisting with medical education, medical writing and other
Nan Liu, Centre for Quantitative related applications. Likewise, health care systems must recognize and address
Medicine, Duke‐NUS Medical School, 8
College Road, Singapore 169857, the challenges posed by LLMs.
Singapore.
Email: [email protected] KEYWORDS
Large language model, AI, Health care
Funding information
None
Abbreviations: AI, artificial Intelligence; BERT, Bidirectional Encoder Representations from Transformers; BioBERT, Bidirectional Encoder
Representations from Transformers for Biomedical Text Mining; CAD, computer‐aided diagnosis; EHR, electronic health records; GPT, Generative
Pretrained Transformer; LLaMA, large language model meta AI; LLMs, large language models; NLP, nature language processing; PaLM, Pathways
Language Model; PMC, PubMed central; USMLE, United States Medical Licensing Examinations.
Rui Yang and Ting Fang Tan are joint‐first authors.
Daniel Shu Wei Ting and Nan Liu are Joint‐senior authors.
This is an open access article under the terms of the Creative Commons Attribution License, which permits use, distribution and reproduction in any medium, provided
the original work is properly cited.
© 2023 The Authors. Health Care Science published by John Wiley & Sons Ltd on behalf of Tsinghua University Press.
FIGURE 1 Potential touch points along a patient's care journey for the application of large language models.
27711757, 2023, 4, Downloaded from https://onlinelibrary.wiley.com/doi/10.1002/hcs2.61, Wiley Online Library on [11/03/2025]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
258 | HEALTH CARE SCIENCE
patients' symptoms, LLMs can improve predictions in the Furthermore, clinicians may even uncover new insights
context of patients' medical background by harnessing or imaging biomarkers of disease in the process.
additional information like comorbidities, risks factors
and medication lists.
3.3 | Management
based conversational assistant that delivers cognitive a human author [45, 46]. LLMs can also match patients
behavior therapy services for adolescents with depres- to potential clinical trial opportunities relevant to
sion. Woebot significantly reduced symptoms of depres- patients' conditions and within inclusion and exclusion
sion, compared to the control group who was given criteria. This can facilitate research patient recruitment,
information‐only e‐book materials [35, 36]. SERMO is while enabling access to potentially breakthrough
another conversational tool that guides patients with treatments that may not be otherwise available or
mental health conditions in regulating their emotions to affordable for patients [45, 46].
better handle negative thoughts [37]. It automatically
detects the type of emotion based on user text inputs, and
recommends mindfulness activities or exercises tailored 4 | CHALLENGES OF LLMs IN
to the specific emotions. HEA LT H CA R E
Furthermore, LLMs have the potential to streamline
administrative processes to increase efficiency while 4.1 | Data privacy
reducing the administrative burden on physicians and
enhancing patient experience. This can encompass One of the challenges in validation and implementation of
drafting discharge summaries and operation reports, LLMs with real‐world clinical patient data would be the risk
extracting succinct clinical information from EHR to of leaking confidential and sensitive patient information. For
complete medical reports and translating them into example, adversarial attacks on a LLM GPT‐2 were
billable codes for reimbursement claims, as well as successful in extracting the model's training data [47, 48].
automating responses to general patient queries (e.g., By querying GPT‐2 structured questions, training data
requests for medication top‐up, appointment booking including personal identifiable information and internet
and rescheduling) [38–40]. relay chat conversations were extracted verbatim. Moreover,
despite anonymizing sensitive patient health information,
some algorithms demonstrated the capability to reidentify
3.4 | Medical education and medical these patients [49–51]. To mitigate these challenges, possible
writing strategies include pseudonymization or filtering patient
identifiers, differential privacy, and auditing of LLMs using
In addition to health care applications from the patient data extraction attacks [47, 48, 52].
perspective, LLMs hold immense potential in reshaping
medical education and research. Existing LLMs have
been able to pass undergraduate and postgraduate 4.2 | Questionable credibility and
medical examinations [19, 41, 42]. Moreover, answers accuracy of information
generated by ChatGPT to USMLE were accompanied by
justifications that have a high level of concordance and Some have criticized LLMs for the questionable credibil-
offered new insights [17, 31]. The logical flow of ity and accuracy of information generated. Open domain
explanations and deductive reasoning with additional nonspecific LLMs may be at risk of perpetuating
supplementary information provided allows students to inaccurate information from open internet sources, or
easily follow and comprehend. For example, this can be generalize poorly across different contextual settings [47,
targeted at an undergraduate medical student who may 48, 52, 53]. The term “hallucination effect” has been used
have answered the question incorrectly, to uncover new to describe trivial guessing behaviors observed in LLMs
perspectives or remedial knowledge from the ChatGPT‐ [54]. For example, an experiment using GPT‐3.5 to
generated explanations. ChatGPT can also suggest answer sample medical questions from USMLE, found
innovative and unique mnemonics to aid memorizing. that the model often predicted options A and D. In the
The interactive interface of LLMs can complement ChatGPT‐generated perspective article, three fabricated
existing student‐directed learning, where Socratic style citations were identified during editing by the human
of teaching has been surveyed as preferable by students author [45]. This may potentially be hazardous to users
over didactic lectures [20, 43, 44]. who are unable to discern seemingly credible but
LLMs can also add value to medical research. LLMs inaccurate or misleading answers. Despite its potential
can improve the efficiency of research article writing by as an educational tool and source of information for
automating tasks such as literature review, generating patients, medical students, and the research community,
text and guiding manuscript writing style and formatting human oversight and additional quality measures are
[44]. Biswas recently published a perspective piece that essential in ensuring accuracy and quality control of the
was written by ChatGPT, though still requiring editing by generated content.
27711757, 2023, 4, Downloaded from https://onlinelibrary.wiley.com/doi/10.1002/hcs2.61, Wiley Online Library on [11/03/2025]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
260 | HEALTH CARE SCIENCE
4.3 | Data bias [62]. Questions that may arise include: Can AI be a
researcher or a physician? Can the AI be responsible for
LLMs are commonly trained on vast and diverse data which the content it generates? How to distinguish the text
is often biased. Consequently, the content generated by generated by AI versus humans? What to do when
LLMs may perpetuate and even amplify bias, such as physicians have different views than AI? It is worth
ethnicity, gender, and socioeconomic background [55]. noting that LLMs may fabricate false content, so it is
These biases are especially problematic in health care, where necessary to avoid overusing [55]. From preconsultation
differential treatment may lead to exacerbation of disparities to diagnosis to treatment, or in medical education,
in mortality and morbidity. For example, a study focusing on medical research, LLMs can serve in complementary
skin cancer may predominantly involve participants with roles rather than substitutes for physicians. Although
fair‐skinned individuals, resulting in an LLM that is less LLMs can undergo self‐improvement, physician over-
adept at identifying skin cancer in those with darker‐skinned sight is still required to ensure the generated content is
individuals. This could lead to misdiagnosis and delayed or accurate and clinically relevant.
inappropriate treatments, further widening health disparit-
ies. The absence of minority groups in training data may
make LLMs exacerbate these biases, leading to inaccurate 4.6 | Deployment of LLMs
results. Moving towards fair artificial intelligence and
combating bias will be a significant challenge for LLMs [56]. LLaMA's open source facilitates the deployment of LLMs
on resource‐constrained devices, such as laptops, phones,
and Raspberry Pi systems [5]. Alpaca's fine‐tuning based
4.4 | Interpretability of LLMs on LLaMA, enables the rapid (within hours) and cost‐
effective (costing under US$600) development of models
The lack of interpretability of the decision‐making that exhibit performance comparable to that of GPT‐3.5
process of LLMs remains a barrier to adoption into [63]. This makes it possible to train personalized
clinical practice [57]. LLM‐generated responses are language models with high performance at a reduced
largely not accompanied by justifications or supporting cost, but it is important to recognize that these models
information sources. This is further exacerbated by the also inherit various biases. When applied to general
tendency of LLMs to fabricate facts in a seemingly purposes, they may generate harmful or sensitive
confident manner or rely on trivial guessing, as content, potentially compromising user security. Fur-
elaborated above. In the context of safety‐critical tasks thermore, the ease of deployment may increase the
in health care, this may limit trust and acceptance by likelihood of LLMs being misused or even maliciously
physicians and patients, where the consequences of trained to disseminate deeply falsified information and
delivering inaccurate medical advice may be detrimental. detrimental content. Such outcomes could undermine
Proposed methods to improve interpretability include a public trust in AI and have deleterious effects on the
selection inference multi‐step reasoning framework by whole society. To ensure that LLMs are harnessed for
Creswell et al. to generate a series of casual reasoning their intended purposes and to reduce the risks
steps toward the final generated response [58]. Another associated with their misuse, it is crucial to develop
method proposed leveraging ChatGPT using chain‐of‐ and implement various safeguards. These may include
thought prompting (i.e. step‐by‐step instructions) [59] for technical solutions for filtering out sensitive and harmful
knowledge graph extraction, where extracted entities and content, the establishment of stringent terms of use and
relationships from the raw input text are presented in a deployment specifications. By adopting such measures,
structured format, which was then used to train an the potential dangers of deploying LLMs on small
interpretable linear model for text classification [60]. personal devices can be effectively controlled.
Uncertainty‐aware LLM applications may be another
useful feature, where differential weights of input data or
reporting confidence scores of generated responses can 4.7 | Clinical domain‐specific LLMs
enhance the trust in proposed LLM applications [61].
There is no doubt that LLMs are having a significant
impact in the health care field. Regardless of whether it is
4.5 | Roles of LLMs preconsultation, diagnosis, management, medical educa-
tion, or medical writing, all these areas will undergo
Another challenge LLMs may face lies in defining its role transformative changes due to the development of LLMs.
and identity in scientific research and clinical practice In this regard, it is essential to recognize that when LLMs
27711757, 2023, 4, Downloaded from https://onlinelibrary.wiley.com/doi/10.1002/hcs2.61, Wiley Online Library on [11/03/2025]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
HEALTH CARE SCIENCE | 261
are deployed in real clinical settings, different medical (equal). Nan Liu: Conceptualization (lead); project
specialties will encounter a variety of unique challenges administration (lead); supervision (lead); writing—
[64]. For example, the type and quality of data may differ original draft (supporting); writing—review and editing
significantly between domains. Additionally, the diverse (equal).
application scenarios and tasks of LLMs will lead to
inconsistencies in the standards expected by clinical AC KNOW LEDGM ENTS
professionals. In light of this, when deploying LLMs in Not applicable.
clinical environments, we should recognize the varia-
tions across clinical specialties and make appropriate C O NF L I C T O F I N T E R E S T S TA T E M E N T
adjustments according to the specific application The authors declare no conflicts of interest.
scenarios.
DATA AVAILABILITY STATEMENT
Not applicable.
5 | CONCLUSION
ETHICS STATEMENT
LLMs are poised to bring about significant transforma- Not applicable.
tion in health care and will be ubiquitous in this field.
To make LLMs more serviceable for health care, INFORMED CONSENT
training from scratch with medical databases or fine‐ Not applicable.
tuning with generic LLMs would be effective ap-
proaches. Besides, LLMs can further perform multi- ORC ID
modal feature fusion with diverse data sources, Arun James Thirunavukarasu http://orcid.org/0000-
including image data and tabular data, resulting in 0001-8968-4768
better performance, even beyond human level. While Nan Liu https://orcid.org/0000-0003-3610-4883
the use of LLMs presents numerous benefits, we
should recognize that LLMs cannot take full responsi- REFER ENCES
bility for generated content. It is essential to ensure 1. Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L,
that AI‐generated content is properly reviewed to avoid Gomez AN, et al. Attention is all you need. Adv Neural Inf
any potential harm. As the threshold for the deploy- Process Syst. 2017;30:1–11. https://proceedings.neurips.cc/paper_
files/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-
ment of LLMs diminishes, improving deployment
Paper.pdf
specifications also deserves attention. Simultaneously,
2. Devlin J, Chang M‐W, Lee K, Toutanova K. BERT: pre‐
efforts should be made to promote the integration of training of deep bidirectional transformers for language
LLMs in clinical practice, improve the interpretability understanding. 2018. https://doi.org/10.48550/arXiv.1810.
of LLMs in clinical settings and enhance human‐ 04805
machine collaboration to better support clinical 3. Brown T, Mann B, Ryder N, Subbiah M, Kaplan JD,
decision‐making. By leveraging LLMs as a complemen- Dhariwal P, et al. Language models are few‐shot learners.
tary tool, physicians can maximize the benefits of AI Adv Neural Inf Process Syst. 2020;33:1877–901. https://doi.
while mitigating potential risks and achieve better org/10.48550/arXiv.2005.14165
clinical outcomes for patients. Ultimately, the success- 4. Chowdhery A, Narang S, Devlin J, Bosma M, Mishra G,
Roberts A, et al. PaLM: scaling language modeling with
ful integration of LLMs in health care will require the
pathways. arXiv:2204.02311. 2022. http://arxiv.org/abs/2204.
collaborative efforts of physicians, data scientists, 02311
administrators, patients, and regulatory bodies. 5. Touvron H, Lavril T, Izacard G, Martinet X, Lachaux M‐A,
Lacroix T, et al. LLaMA: open and efficient foundation
AUTHOR CONTRIBUTIONS language models. 2023. http://arxiv.org/abs/2302.13971
Rui Yang: Conceptualization (equal); writing—original 6. OpenAI. GPT‐4 Technical Report. 2023. http://arxiv.org/abs/
draft (equal); writing—review and editing (equal). Ting 2303.08774
Fang Tan: Conceptualization (equal); writing—original 7. Amatriain X. Transformer models: an introduction and
draft (equal); writing—review and editing (equal). Wei catalog. 2023. http://arxiv.org/abs/2302.07730
8. Kaplan J, McCandlish S, Henighan T, Brown TB, Chess B,
Lu: Conceptualization (equal); writing—review and
Child R, et al. Scaling laws for neural language models. 2020.
editing (equal). Arun James Thirunavukarasu: Con- http://arxiv.org/abs/2001.08361
ceptualization (equal); writing—review and editing 9. Zhavoronkov A. Caution with AI‐generated content in
(equal). Daniel Shu Wei Ting: Conceptualization biomedicine. Nature Med. 2023;29(3):532. https://doi.org/10.
(equal); supervision (lead); writing—review and editing 1038/d41591-023-00014-w
27711757, 2023, 4, Downloaded from https://onlinelibrary.wiley.com/doi/10.1002/hcs2.61, Wiley Online Library on [11/03/2025]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
262 | HEALTH CARE SCIENCE
10. He Y, Zhu Z, Zhang Y, Chen Q, Caverlee J. Infusing disease 24. Johnson AEW, Pollard TJ, Shen L, Lehman LH, Feng M,
knowledge into BERT for health question answering, medical Ghassemi M, et al. MIMIC‐III, a freely accessible critical care
inference and disease name recognition. 2020. https://doi.org/ database. Sci Data. 2016;3:160035. https://doi.org/10.1038/
10.48550/arXiv.2010.03746 sdata.2016.35
11. Li C, Zhang Y, Weng Y, Wang B, Li Z. Natural language 25. Yang X, Chen A, PourNejatian N, Shin HC, Smith KE,
processing applications for Computer‐Aided diagnosis in Parisien C, et al. A large language model for electronic
oncology. Diagnostics. 2023;13(2):286. https://doi.org/10. health records. npj digital Medicine. 2022;5(1):
3390/diagnostics13020286 1–9. https://doi.org/10.1038/s41746-022-00742-2
12. Omoregbe NAI, Ndaman IO, Misra S, Abayomi‐Alli OO, 26. Med‐PaLM. Med‐PaLM [Internet]. Available from: https://
Damaševičius R. Text Messaging‐Based medical diagnosis sites.research.google/med-palm/
using natural language processing and fuzzy logic. J Healthc 27. Matias Y. Our latest health AI research updates. Google
Eng. 2020;2020(4):1–14. https://doi.org/10.1155/2020/8839524 [Internet]. Available from: https://blog.google/technology/
13. Lee J, Yoon W, Kim S, Kim D, Kim S, So CH, et al. BioBERT: a health/ai-llm-medpalm-research-thecheckup/
pre‐trained biomedical language representation model for 28. Li Y, Li Z, Zhang K, Dan R, Zhang Y. ChatDoctor: a medical
biomedical text mining. Bioinformatics. 2020;36(4):1234–40. chat model fine‐tuned on LLaMA model using medical
https://doi.org/10.1093/bioinformatics/btz682 domain knowledge. 2023. http://arxiv.org/abs/2303.14070
14. Alsentzer E, Murphy JR, Boag W, Weng W‐H, Jin D, 29. Xu C, Guo D, Duan N, McAuley J Baize: an open‐source chat
Naumann T, et al. Publicly available clinical BERT embed- model with parameter‐efficient tuning on self‐chat data. 2023.
dings. 2019. https://doi.org/10.48550/arXiv.1904.03323 http://arxiv.org/abs/2304.01196
15. Beltagy I, Lo K, Cohan A. SciBERT: a pretrained language model 30. Ben Abacha A, Demner‐Fushman D. A question‐entailment
for scientific text. Proceedings of the 2019 conference on empirical approach to question answering. BMC Bioinformatics. 2019;
methods in natural language processing and the 9th International 20(1):511. https://doi.org/10.1186/s12859-019-3119-4
Joint Conference on Natural Language Processing (EMNLP‐ 31. World Health Organization. WHO global strategy on people‐
IJCNLP). 2019. https://doi.org/10.18653/v1/d19-1371 centred and integrated health services: interim report. World
16. Gu Y, Tinn R, Cheng H, Lucas M, Usuyama N, Liu X, et al. Health Organization; 2015. https://apps.who.int/iris/handle/
Domain‐Specific language model pretraining for biomedical 10665/155002
natural language processing. ACM Trans Comput Healthcare. 32. Kenneth Leung on LinkedIn. Available from: https://www.
2022;3(1):1–23. https://doi.org/10.1145/3458754 linkedin.com/posts/kennethleungty_generativeai-ai-
17. Wang J, Zhang G, Wang W, Zhang K, Sheng Y. Cloud‐based pharmacist-activity-7031533843429949440-pVZb
intelligent self‐diagnosis and department recommendation service 33. Bala S, Keniston A, Burden M. Patient perception of Plain‐
using Chinese medical BERT. Journal of Cloud Computing. Language medical notes generated using artificial intelligence
2021;10:1–12. https://doi.org/10.1186/s13677-020-00218-2 software: pilot Mixed‐Methods study. JMIR Formative
18. Shen Y, Heacock L, Elias J, Hentel KD, Reig B, Shih G, et al. Research. 2020;4(6):e16670. https://doi.org/10.2196/16670
ChatGPT and other large language models are double‐edged 34. Van H, Kauchak D, Leroy G. AutoMeTS: the autocomplete for
swords. Radiology. 2023;307(2):230163. https://doi.org/10. medical text simplification. 2020. https://doi.org/10.48550/
1148/radiol.230163 arXiv.2010.10573
19. Kung TH, Cheatham M, Medenilla A, Sillos C, De Leon L, 35. Vaidyam AN, Wisniewski H, Halamka JD, Kashavan MS,
Elepaño C, et al. Performance of ChatGPT on USMLE: Torous JB. Chatbots and conversational agents in mental
potential for AI‐assisted medical education using large health: a review of the psychiatric landscape. Canadian
language models. PLOS Digital Health. 2023;2(2):e0000198. J Psychi. 2019;64(7):456–64. https://doi.org/10.1177/
https://doi.org/10.1371/journal.pdig.0000198 0706743719828977
20. Gilson A, Safranek CW, Huang T, Socrates V, Chi L, 36. Fitzpatrick KK, Darcy A, Vierhile M. Delivering cognitive
Taylor RA, et al. How does ChatGPT perform on the United behavior therapy to young adults with symptoms of depres-
States medical licensing examination? The implications of sion and anxiety using a fully automated conversational agent
large language models for medical education and knowledge (Woebot): a randomized controlled trial. JMIR Ment Health.
assessment. JMIR Med Educ. 2023;9:e45312. https://doi.org/ 2017;4(2):e19. https://doi.org/10.2196/mental.7785
10.2196/45312 37. Denecke K, Vaaheesan S, Arulnathan A. A mental health
21. Kitamura FC. ChatGPT is shaping the future of medical chatbot for regulating emotions (SERMO)—concept and
writing but still requires human judgment. Radiology. usability test. IEEE Transact Emerg Topics Comput. 2021;9:
2023;307(2):230171. https://doi.org/10.1148/radiol.230171 1170–82. https://doi.org/10.1109/tetc.2020.2974478
22. Thirunavukarasu A, Hassan R, Mahmood S, Sanghera R, 38. Singh S, Djalilian A, Ali MJ. ChatGPT and ophthalmology:
Barzangi K, El Mukashfi M, et al. Trialling a large language exploring its potential with discharge summaries and opera-
model (ChatGPT) with Applied Knowledge Test questions: tive notes. Semin Ophthalmol. 2023;38(5):503–7. https://doi.
what are the opportunities and limitations of artificial org/10.1080/08820538.2023.2209166
intelligence chatbots in primary care? (Preprint). 2023. 39. Patel SB, Lam K. ChatGPT: the future of discharge summa-
https://doi.org/10.2196/preprints.46599 ries? Lancet Digital Health. 2023;5(3):e107–8. https://doi.org/
23. Lei L, Liu D. A new medical academic word list: a corpus‐based 10.1016/S2589-7500(23)00021-3
study with enhanced methodology. J English Acad Purp. 2016;22: 40. Insights CB. How artificial intelligence is reshaping medical
42–53. https://doi.org/10.1016/j.jeap.2016.01.008 billing & insurance. CB Insights Research [Internet].
27711757, 2023, 4, Downloaded from https://onlinelibrary.wiley.com/doi/10.1002/hcs2.61, Wiley Online Library on [11/03/2025]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
HEALTH CARE SCIENCE | 263
Available from: https://www.cbinsights.com/research/ for health research: still a ways to go. Sci Transl Med.
artificial-intelligence-healthcare-providers-medical-billing- 2021;13(586):eabb1655. https://doi.org/10.1126/scitranslmed.
insurance/ abb1655
41. Varanasi L. AI models like ChatGPT and GPT‐4 are acing 54. OpenAI. ChatGPT: Optimizing Language Models for Dia-
everything from the bar exam to AP Biology. Here's a list of logue. In: OpenAI [Internet]. Available from: https://openai.
difficult exams both AI versions have passed. 2023. Website. com/blog/chatgpt/
https://www.businessinsider.com/list-here-are-the-exams- 55. Volovici V, Syn NL, Ercole A, Zhao JJ, Liu N. Steps to avoid
chatgpt-has-passed-so-far-2023-1 overuse and misuse of machine learning in clinical research.
42. Singhal K, Azizi S, Tu T, Mahdavi SS, Wei J, Chung HW, et al. Nature Med. 2022;28(10):1996–9. https://doi.org/10.1038/
Large language models encode clinical knowledge. 2022. https:// s41591-022-01961-6
doi.org/10.48550/arxiv.2212.13138 56. Norori N, Hu Q, Aellen FM, Faraci FD, Tzovara A. Addressing
43. Burk‐Rafel J, Santen SA, Purkiss J. Study behaviors and bias in big data and AI for health care: a call for open science.
USMLE step 1 performance: implications of a student Self‐ Patterns. 2021;2(10):100347. https://doi.org/10.1016/j.patter.
Directed parallel curriculum. Acad Med. 2017;92:S67–74. 2021.100347
https://doi.org/10.1097/ACM.0000000000001916 57. Tjoa E, Guan C. A survey on explainable artificial intelligence
44. Abou‐Hanna JJ, Owens ST, Kinnucan JA, Mian SI, Kolars JC. (XAI): toward medical XAI. IEEE transactions on neural
Resuscitating the socratic method: student and faculty networks and learning systems. 2021;32(11):4793–813. https://
perspectives on posing probing questions during clinical doi.org/10.1109/TNNLS.2020.3027314
teaching. Acad Med. 2021;96(1):113–7. https://doi.org/10. 58. Creswell A, Shanahan M, Higgins I. Selection‐Inference:
1097/ACM.0000000000003580 exploiting large language models for interpretable logical
45. Biswas S. ChatGPT and the future of medical writing. reasoning. 2022. http://arxiv.org/abs/2205.09712
Radiology. 2023;307(2):223312. https://doi.org/10.1148/radiol. 59. Wei J, Wang X, Schuurmans D, Bosma M, Ichter B, Xia F,
223312 et al. Chain‐of‐thought prompting elicits reasoning in large
46. BuildGreatProducts.club. The Potential of Large Language language models. 2022. http://arxiv.org/abs/2201.11903
Models(LLMs) in Healthcare: Improving Quality of Care and 60. Shi Y, Ma H, Zhong W, Mai G, Li X, Liu T, et al. ChatGraph:
Patient Outcomes. In: Medium [Internet]. Available from: interpretable text classification by converting ChatGPT
https://medium.com/@BuildGP/the-potential-of-large- knowledge to graphs. 2023. http://arxiv.org/abs/2305.03513
language-models-in-healthcare-improving-quality-of-care- 61. Youssef A, Abramoff M, Char D. Is the algorithm good in a
and-patient-6e8b6262d5ca bad world, or has it learned to be bad? The ethical challenges
47. Carlini N, Tramer F, Wallace E, Jagielski M, Herbert‐Voss A, of “locked” versus “continuously learning” and “autonomous”
Lee K, et al. Extracting training data from large language versus “assistive” AI tools in healthcare. Am J Bioeth.
models. 2020. https://doi.org/10.48550/arXiv.2012.07805 2023;23(5):43–5. https://doi.org/10.1080/15265161.2023.
48. Yang X, Lyu T, Li Q, Lee C‐Y, Bian J, Hogan WR, et al. A study of 2191052
deep learning methods for de‐identification of clinical notes in 62. Liebrenz M, Schleifer R, Buadze A, Bhugra D, Smith A.
cross‐institute settings. BMC Med Inform Decis Mak. Generating scholarly content with ChatGPT: ethical challenges
2019;19(Suppl 5):232. https://doi.org/10.1186/s12911-019-0935-4 for medical publishing. Lancet Digital Health. 2023;5(3):e105–6.
49. Gymrek M, McGuire AL, Golan D, Halperin E, Erlich Y. https://doi.org/10.1016/S2589-7500(23)00019-5
Identifying personal genomes by surname inference. Science. 63. Stanford CRFM. Alpaca: a strong, replicable instruction‐
2013;339(6117):321–4. https://doi.org/10.1126/science.1229566 following model. Available from: https://crfm.stanford.edu/
50. Na L, Yang C, Lo C‐C, Zhao F, Fukuoka Y, Aswani A. 2023/03/13/alpaca.html
Feasibility of reidentifying individuals in large national 64. Cascella M, Montomoli J, Bellini V, Bignami E. Evaluating the
physical activity data sets from which protected health feasibility of ChatGPT in healthcare: an analysis of multiple
information has been removed with use of machine learning. clinical and research scenarios. J Med Syst. 2023;47(1):33.
JAMA Network Open. 2018;1(8):e186040. https://doi.org/10. https://doi.org/10.1007/s10916-023-01925-4
1001/jamanetworkopen.2018.6040
51. Erlich Y, Shor T, Pe'er I, Carmi S. Identity inference of genomic
data using long‐range familial searches. Science. 2018;362(6415):
690–4. https://doi.org/10.1126/science.aau4832 How to cite this article: Yang R, Tan TF, Lu W,
52. Du L, Xia C, Deng Z, Lu G, Xia S, Ma J. A machine learning Thirunavukarasu AJ, Ting DSW, Liu N. Large
based approach to identify protected health information in
language models in health care: development,
Chinese clinical text. Int J Med Inform. 2018;116:24–32.
https://doi.org/10.1016/j.ijmedinf.2018.05.010
applications, and challenges. Health Care Sci.
53. McDermott MBA, Wang S, Marinsek N, Ranganath R, 2023;2:255–263. https://doi.org/10.1002/hcs2.61
Foschini L, Ghassemi M. Reproducibility in machine learning