0% found this document useful (0 votes)
41 views

Decoding ChatGPT A Primer On Large Language Models For Clinicians

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
41 views

Decoding ChatGPT A Primer On Large Language Models For Clinicians

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

Intelligence-Based Medicine 8 (2023) 100114

Contents lists available at ScienceDirect

Intelligence-Based Medicine
journal homepage: www.sciencedirect.com/journal/intelligence-based-medicine

Decoding ChatGPT: A primer on large language models for clinicians

A B S T R A C T

The rapid progress of artificial intelligence (AI) and the adoption of Large Language Models (LLMs) suggests that these technologies will transform healthcare in the
coming years. We present a primer on LLMs for clinicians, focusing on OpenAI’s Generative Pretrained Transformer-4 (GPT-4) model which powers ChatGPT as a use-
case, as it has already seen record-breaking uptake in usage. ChatGPT generates natural-sounding text based on patterns observed from vast amounts of training data.
The core strengths of ChatGPT and LLMs in healthcare applications include summarization and text generation, rapid adaptation and learning, and ease of cus­
tomization and integration into existing applications. However, clinicians should also recognize the limitations of LLMs, most notably concerns about inaccuracy,
privacy, accountability, transparency, and explainability. Clinicians must embrace the opportunity to explore, engage, and lead in the responsible integration of
LLMs, harnessing their potential to revolutionize patient care and drive advancements in an ever-evolving healthcare landscape.

1. Introduction technologies allow computers to “understand” language and analyze


its meaning, intent, and sentiment [2,3].
Over the past year, artificial intelligence (AI) has achieved remark­ • A large language model is a neural network-based language model
able progress, with Large Language Models (LLMs) like ChatGPT that contains billions to trillions of parameters and is trained on vast
achieving impressive performance and adoption across multiple in­ quantities of text data. LLMs can generate natural language text and
dustries. As the integration of these technologies into the medical field perform language-related tasks like text completion, summarization,
becomes increasingly imminent, it is crucial for physicians, regardless of and conversational dialogue [4].
their technical expertise, to develop a fundamental understanding of • Transformer architecture was introduced in 2017 and is a type of
how these models function. In contrast to the growing literature neural network architecture designed to handle sequential input data
exploring potential applications of LLMs, we aim for this editorial to [5]. However, these models aren’t restricted to processing that data
function as a foundational and accessible primer on LLMs at a level of in sequential order. Instead, they use attention—a technique that
abstraction relevant for a clinical audience with or without formal allows models to assign different levels of influence to different
training in computer science or machine learning. Our goal is to estab­ pieces of input data and to identify the context for individual pieces
lish a common vocabulary and understanding that facilitates clear of data in an input sequence. This allows for the management of
communication and collaboration between clinicians, AI researchers, long-range dependencies, which are relationships or connections
and other stakeholders in the medical community. between distant words or elements in a sequence. They form the
Key terms for this discussion: basis for numerous state-of-the-art language models, including LLMs
[1].
• Deep learning, a type of machine learning, uses artificial neural
networks (multilayered networks of nodes that are linked together Several prominent LLMs have emerged recently, including Google’s
and loosely resemble how human neurons work) to allow systems to PALM-2 which now powers the Bard chatbot, and OpenAI’s Generative
learn and make decisions based on unstructured and unlabeled data. Pretrained Transformer model (GPT-4, the most recent iteration
Machine learning trains AI systems to learn from inputs, recognize released 3/14/2023) which powers Bing Chat and the paid version of
patterns, and make recommendations. With deep learning specif­ ChatGPT. We will focus our review of LLMs on GPT-4 and ChatGPT, as
ically, instead of just responding to sets of rules, digital systems build this was the first LLM to provide a widely adopted public-facing inter­
knowledge from examples and then use that knowledge to react, active platform. OpenAI has also introduced the GPT-4 and ChatGPT
behave, and perform in a more human-like manner [1]. Application Programming Interfaces (APIs), which are tools that enable
• Natural Language Processing refers to the branch of AI concerned the integration of these models into existing applications, making these
with giving computers the ability to understand text and spoken a popular choice for further development and cross-platform imple­
words in a human-like manner. It does this by combining rule-based mentation. Finally, ChatGPT has experienced a record-breaking level of
modeling of human language (computational linguistics) with sta­ uptake, reaching one million users in just five days [6] and growing to
tistical, machine learning, and deep learning models. These 100 million users within two months, setting the record for the
fastest-growing user base [7].

https://doi.org/10.1016/j.ibmed.2023.100114
Received 5 July 2023; Received in revised form 11 October 2023; Accepted 11 October 2023
Available online 12 October 2023
2666-5212/© 2023 The Authors. Published by Elsevier B.V. This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-
nc-nd/4.0/).
R.B. Hunter et al. Intelligence-Based Medicine 8 (2023) 100114

2. Mechanics of LLMs based on statistical relationships [11].


In addition to the core GPT framework, ChatGPT employs an addi­
GPT models can generate natural-sounding text by continually tional training phase called Reinforcement Learning from Human
selecting words (or tokens) that form a “reasonable continuation” of Feedback (RLHF). This method incorporates human-generated exam­
what input text has been provided. Text choice is based on patterns ples and human evaluation of AI-created prompts and responses to
observed from vast amounts of training data, likely trillions of English establish a reward function. This reward function guides the GPT model
words for the most recent GPT-4 model (note: GPT-4 and therefore towards producing responses that better align with human preferences.
ChatGPT have only been trained on text data before September 2021) Clinicians and subject matter experts should be involved in the RLHF
[8]. When ChatGPT is analyzing text, it isn’t looking at full words in the process, as their expertise ensures the development of a more accurate
way humans might, but instead at tokens, which are small units of and reliable model in the context of medical use cases. As a result of this
meaningful text (words, punctuation, or parts of words) used to repre­ process, ChatGPT can generate more contextually relevant and accurate
sent language [9]. information, making it particularly useful in the medical field where
Parameters are adjustable numerical values that define the model’s nuanced and precise communication is crucial [12].
architecture and determine how it processes these tokens and generates
text. They capture relationships and weigh dependencies between to­ 3. Core strengths of LLMs in healthcare applications
kens, enabling the model to understand and predict the likelihood of a
token appearing in each context. Parameters are learned during training LLMs such as ChatGPT are poised to impact every domain of
by testing the model and minimizing the error between the model’s healthcare delivery with early examples of its use in medical education,
outputs and actual language patterns in the training data. The more clinical decision support, and interpretation of medical literature
parameters a model has, the more complex patterns and nuances it can [13–16]. Dedicated studies of ChatGPT for medical application are
capture, resulting in better language generation. GPT-4, for example, sparse, but early feasibility studies testing ChatGPT’s ability to answer
likely contains over a trillion parameters, far more than the number of patient preventive healthcare questions, draft discharge summaries, or
tokens in human language, which allows it to incorporate nuanced re­ provide clinical decision support have shown promising results [17–20].
lationships and generate high-quality text. As such, the term “Large” in We note that these studies have evaluated the older version of ChatGPT
Large Language Model generally refers to the number of parameters which is built on GPT-3.5, leaving a gap in understanding the newest
present [10]. GPT-4-powered model’s capabilities [14,17,18,20,21]. GPT-4’s
ChatGPT determines the probabilities of different tokens that could impressive performance gains in relevant standardized tests, a practical
logically follow a given text segment and selects a token to continue the method for inferring capabilities between models in a domain, suggest
sentence. To avoid monotonous and repetitive text, ChatGPT sometimes that it may provide a critical leap in capabilities relevant for healthcare
chooses a token that is not the highest-ranked (most probable) token but applications. For example, average performance across the three steps of
is still plausible. This random introduction of a less-common token, the US Medical Licensing Exam (USMLE) increased from 49 % of ques­
controlled by a parameter called temperature, allows the model to tions answered correctly with GPT 3.5–87 % correct for GPT-4 – a
exhibit creativity, produce interesting text, and respond differently transition from poor performance to far exceeding the usual threshold
when given the same input (Fig. 1) [11]. To underscore a critical point, for passing these tests [22]. We anticipate future work will continue to
while ChatGPT generates confident and plausible-sounding responses, evaluate this newer model and demonstrate its strengths across a wide
the model does not understand the content it generates in any mean­ variety of medical tasks. Clinicians should understand the key strengths
ingful sense; instead, it is trained to mimic human language patterns of these models that make it an invaluable asset for enhancing

Fig. 1. Stephen Wolfram’s Illustration of ChatGPT’s Text Generation Process. [11] *For this example, we use the term “words” for simplicity and clarity, but in
reality ChatGPT utilizes tokens, or small units of meaningful text.

2
R.B. Hunter et al. Intelligence-Based Medicine 8 (2023) 100114

healthcare practice: quality (79 % compared to 22 % of human responses) and were more
likely to be Empathetic or Very Empathetic (45 % compared to 5 % of
1. Summarization and Text Generation: GPT-4 is a decoder model physician responses) [20]. Although these data are promising,
which means that at each stage, for a given word, the attention layers ChatGPT is also capable of “hallucinations,” where the model pro­
of the model can only access words positioned before it in that sen­ duces fabricated data or information to respond to a user’s query, but
tence [3]. This attribute is excellent for language modeling, text quite often in a confident, coherent, and very believable way. A
generation, and summarization. It is no surprise that it has the po­ notable example of ChatGPT’s tendency to hallucinate is that it will
tential to revolutionize interaction with electronic health records. often fabricate very plausible but invalid references when prompted
Epic has already announced plans to integrate large language to justify where a fact or response came from. The frequency of
models, which could save time for clinicians through the generation hallucinations is unknown in general and will vary between medical
of advanced text notes, concise summarization of medical care his­ and non-medical contexts.
tory, finding information faster, and more [23]. It could also 2. Privacy and Accountability: GPT-4 has been trained on trillions of
streamline revenue cycles by simplifying prior authorizations, words scraped from various online sources, some of which may
speeding up coding and charging, and identifying potential errors contain personal information obtained without consent, posing a
that lead to denials before they are submitted. potential violation of privacy standards. When using ChatGPT,
2. Rapid adaptation and learning: ChatGPT can quickly adapt to new OpenAI has previously stored user data automatically, raising con­
information and learn from user interactions, enabling it to improve cerns about privacy and security. In April 2023, a data breach
its performance over time. One way ChatGPT achieves rapid adap­ affected less than 1 % of users’ data, highlighting the risks associated
tation is through few-shot learning, where the general model learns with transmitting secure data to ChatGPT [26]. It is important to
to perform a task with only a small number of provided examples note that when using the GPT API (not ChatGPT), OpenAI stores user
[24]. Fine-tuning is the process of adjusting the model’s pre-trained data for 30 days before permanently deleting it, and only uses the
weights using a smaller, domain-specific dataset; once fine-tuned, data for model training if the user opts in [27]. Furthermore, when
examples no longer need to be provided to the model to get users interact with ChatGPT, they may inadvertently share sensitive
desired outputs. Because ChatGPT has been trained on vast amounts information. For example, a medical professional may use ChatGPT
of data, it can often be fine-tuned with significantly less data than to review a patient’s health record, inadvertently exposing confi­
prior models, on the order of several hundred examples, which dential patient information to the system. We note that these policies
makes it a more versatile and adaptable tool in healthcare. have been in flux over the last several months and will be different
3. Customization and integration: GPT architecture and ChatGPT can depending on the specific LLM, so users must be vigilant regarding
be integrated into existing applications and programs with relative the exact terms and conditions of the tool being used.
ease. This adaptability allows healthcare organizations to leverage 3. Transparency and Explainability: While these models can
the power of LLMs in a manner tailored to their unique needs and generate remarkable text that appears to exhibit sound reasoning,
workflows. A notable example of this integration came with Nuance the process by which they curate and select information is not always
announcing the Dragon Ambient eXperience Express, a fully auto­ apparent. Although LLMs can provide step-by-step reasoning using
mated clinical documentation application employing GPT-4 which natural language, there is still no definitive method for interpreting
aims to generate full clinical notes based on passive listening to the inner workings of LLMs due to their complexity, making it
patient-clinician conversations [23,25]. With the ability to integrate challenging to discern the types of knowledge, reasoning, or goals
with electronic health records, telemedicine platforms, and various employed by the model when generating certain outputs [28].
other healthcare-related software, ChatGPT and other LLMs can Furthermore, unlike predictive models trained for specific outcomes,
enhance collaboration and communication within the medical which can be assessed for performance despite their black-box na­
community, streamline processes, and improve patient care. ture, the outputs of LLMs lack objective measures of accuracy when
engaged in general dialogue. This absence of clear evaluation criteria
4. Core weaknesses of LLMs in healthcare applications may hinder the adoption of performance-based assessments in favor
of transparent processes that elucidate the model’s decision-making
As new technologies make their way into clinical practice, evaluating for a given output.
the potential risks and real-world consequences for patients is crucial.
Clinicians seeking to incorporate these tools into their daily workflow, 5. Final thoughts and encouraging collaboration: A path
or discussing them with patients and families, must be aware of the key forward for clinicians
limitations and barriers associated with using such models in clinical
medicine. Most of these models, including ChatGPT, pose challenges In conclusion, integrating LLMs like ChatGPT into healthcare offers
when assessing these risks, as evaluations are primarily limited to immense potential to enhance care delivery and alleviate administrative
examining model outputs (generated responses) rather than conducting burdens. LLMs are capable of efficiently analyzing and summarizing
a comprehensive analysis of the model architecture, training data, and large volumes of language data while exhibiting great potential for rapid
other factors. Here we outline three key limitations of these models that adaptability and customization, making them invaluable assets for
are relevant to clinicians: streamlining healthcare delivery. However, clinicians must remain
vigilant of the possible risks, such as inaccurate responses and halluci­
1. Accuracy: Internal factual accuracy evaluation by OpenAI was only nations, significant privacy concerns, and a lack of transparency in un­
81 % when discussing scientific topics for GPT-4, up from 62 % with derstanding model training and behavior. By proactively engaging with
GPT 3.5 [8]. Few clinical studies have evaluated the accuracy of these emerging technologies and fostering a robust knowledge base,
ChatGPT, and to our knowledge, none have yet evaluated the model healthcare professionals can collaboratively work towards the safe and
powered by GPT-4. Sarraju et al. tested ChatGPT’s (GPT-3.5) effective implementation of LLMs, ultimately fostering better patient
response to simple preventive cardiology questions, and found 84 % care and driving innovation within the healthcare landscape.
accuracy with 100 % reliability (i.e. answer meaning was consistent During the preparation of this work the author(s) used ChatGPT
with repeated prompting) [17]. Ayers et al. recently tested during early-stage outlining and for concept generation. After using this
ChatGPT’s response to online questions from a public social media tool/service, the author(s) reviewed and edited the content as needed
forum (Reddit’s/AskDocs) and found that ChatGPT’s responses to and take(s) full responsibility for the content of the publication.
medical questions were more likely to be Good or Very Good in

3
R.B. Hunter et al. Intelligence-Based Medicine 8 (2023) 100114

Declaration of competing interest [17] Sarraju A, Bruemmer D, Van Iterson E, Cho L, Rodriguez F, Laffin L. Appropri­
ateness of cardiovascular disease prevention recommendations obtained from a
popular online chat-based artificial intelligence model. JAMA 2023. https://doi.
The authors declare the following financial interests/personal re­ org/10.1001/jama.2023.1044. Published online February 3.
lationships which may be considered as potential competing interests: A. [18] Ayoub NF, Lee YJ, Grimm D, Balakrishnan K. Comparison between ChatGPT and
Chang is a senior editor for Intelligence Based Medicine, is founder and Google search as sources of postoperative patient instructions. JAMA Otolar­
yngology–Head & Neck Surgery 2023. https://doi.org/10.1001/
medical director of the Medical Intelligence and Innovation Institute, jamaoto.2023.0704. Published online April 27.
and has founded CardioGenomic Intelligence, Artificial Intelligence in [19] Johnson D, Goodman R, Patrinely J, et al. Assessing the accuracy and reliability of
Medicine (AIMed), and Medical Intelligence 10 (MI10). AI-generated medical responses: an evaluation of the chat-GPT model. Review
2023. https://doi.org/10.21203/rs.3.rs-2566942/v1.
A. Limon serves as a principal consultant at Oneirix Labs. [20] Ayers JW, Poliak A, Dredze M, et al. Comparing physician and artificial intelli­
gence chatbot responses to patient questions posted to a public social media forum.
References JAMA Intern Med 2023. https://doi.org/10.1001/jamainternmed.2023.1838.
Published online April 28.
[21] Kung TH, Cheatham M, Medenilla A, et al. Performance of ChatGPT on USMLE:
[1] What is deep learning? | microsoft azure. https://azure.microsoft.com/en-us/
potential for AI-assisted medical education using large language models. PLOS
resources/cloud-computing-dictionary/what-is-deep-learning/. [Accessed 2 May
Digital Health 2023;2(2):e0000198. https://doi.org/10.1371/journal.
2023].
pdig.0000198.
[2] What is Natural Language processing? | IBM. https://www.ibm.com/topics
[22] Nori H, King N, McKinney SM, Carignan D, Horvitz E. Capabilities of GPT-4 on
/natural-language-processing. [Accessed 3 May 2023].
medical challenge problems.
[3] Introduction - hugging face NLP course. https://huggingface.co/learn/nlp-c
[23] Diaz N. Epic to use Microsoft’s GPT-4 in EHRs. Published March 30, https://www.
ourse/chapter1/1. [Accessed 3 May 2023].
beckershospitalreview.com/ehrs/epic-to-use-microsofts-open-ai-in-ehrs.html.
[4] Large Language models: complete guide in 2023. https://research.aimultiple.
[Accessed 12 April 2023].
com/large-language-models/. [Accessed 4 May 2023].
[24] OpenAI API. https://platform.openai.com. [Accessed 12 April 2023].
[5] Vaswani A, Shazeer N, Parmar N, et al. Attention is all you need. Advances in
[25] Landi H. Microsoft’s Nuance integrates OpenAI’s GPT-4 into voice-enabled medical
neural information processing systems, vol. 30. Curran Associates, Inc.; 2017. http
scribe software. Fierce Healthcare. Published March 21, https://www.fiercehealt
s://proceedings.neurips.cc/paper_files/paper/2017/hash/3f5ee243547dee91f
hcare.com/health-tech/microsofts-nuance-integrates-openais-gpt-4-medical-scri
bd053c1c4a845aa-Abstract.html. [Accessed 12 May 2023].
be-software. [Accessed 12 April 2023].
[6] Infographic: ChatGPT sprints to one million users. Statista infographics. Published
[26] ChatGPT Confirms Data Breach. Raising security concerns. Security intelligence.
January 24, https://www.statista.com/chart/29174/time-to-one-million-users.
2023. Published May 2, https://securityintelligence.com/articles/chatgpt-confirm
[Accessed 30 March 2023].
s-data-breach/. [Accessed 4 May 2023].
[7] ChatGPT sets record for fastest-growing user base - analyst note | Reuters. https:
[27] API data usage policies. https://openai.com/policies/api-data-usage-policies.
//www.reuters.com/technology/chatgpt-sets-record-fastest-growing-user-base-
[Accessed 4 May 2023].
analyst-note-2023-02-01/. [Accessed 30 March 2023].
[28] Bowman SR. Eight things to know about Large Language Models.
[8] GPT-4. https://openai.com/product/gpt-4. [Accessed 18 March 2023].
[9] What are tokens and how to count them? | OpenAI Help Center. https://help.open
ai.com/en/articles/4936856-what-are-tokens-and-how-to-count-them. [Accessed 3 R. Brandon Hunter*
May 2023]. Department of Pediatrics, Division of Critical Care, Texas Children’s
[10] What is a large language model (LLM)? – TechTarget Definition. WhatIs.com.
https://www.techtarget.com/whatis/definition/large-language-model-LLM. Hospital and Baylor College of Medicine, Houston, TX, 77030, USA
[Accessed 3 May 2023].
[11] Wolfram Stephen. What is ChatGPT doing … and why does it work? Stephen Sanjiv D. Mehta
Wolfram writings. Published February 14, https://writings.stephenwolfram.com/ Department of Pediatrics, Division of Anesthesiology and Critical Care
2023/02/what-is-chatgpt-doing-and-why-does-it-work/. [Accessed 30 March Medicine, Children’s Hospital of Philadelphia and the University of
2023].
[12] Introducing ChatGPT. https://openai.com/blog/chatgpt. [Accessed 11 April Pennsylvania, Philadelphia, PA, 19104, USA
2023].
[13] Moor M, Banerjee O, Abad ZSH, et al. Foundation models for generalist medical
Alfonso Limon
artificial intelligence. Nature 2023;616(7956):259–65. https://doi.org/10.1038/ Oneirix Labs, Carlsbad, CA, 92008, USA
s41586-023-05881-4.
[14] Cascella M, Montomoli J, Bellini V, Bignami E. Evaluating the feasibility of Anthony C. Chang
ChatGPT in healthcare: an analysis of multiple clinical and research scenarios. Department of Pediatrics, Division of Cardiology, Children’s Hospital of
J Med Syst 2023;47(1):33. https://doi.org/10.1007/s10916-023-01925-4.
Orange County, Orange, CA, 92868, USA
[15] Haman M, Školník M. Exploring the capabilities of ChatGPT in academic research
recommendation. Resuscitation 2023. https://doi.org/10.1016/j.resuscita­
tion.2023.109795. 0(0). *
Corresponding author.
[16] Boßelmann CM, Leu C, Lal D. Are AI language models such as ChatGPT ready to
improve the care of individuals with epilepsy? Epilepsia. n/a(n/a). doi:10.1111/ E-mail address: [email protected] (R.B. Hunter).
epi.17570.

You might also like