Using ChatGPT-4 to Create Structured Medical Notes From Audio Recordings of Physician-Patient Encounters_ Comparative Study - PMC
Using ChatGPT-4 to Create Structured Medical Notes From Audio Recordings of Physician-Patient Encounters_ Comparative Study - PMC
report=printable
1
Department of Medical Informatics and Clinical Epidemiology, Oregon Health and Sciences University, Portland,
OR, United States
Vishnu Mohan, Department of Medical Informatics and Clinical Epidemiology, Oregon Health and Sciences
University, 3181 SW Sam Jackson Park Road, Portland, OR, 97239, United States, Phone: 1 5034944469, Email:
[email protected].
Corresponding author.
Corresponding Author: Vishnu Mohan [email protected]
Received 2023 Nov 9; Revisions requested 2024 Jan 31; Revised 2024 Feb 20; Accepted 2024 Mar 10.
Copyright ©Annessa Kernberg, Jeffrey A Gold, Vishnu Mohan. Originally published in the Journal of Medical
Internet Research (https://www.jmir.org), 22.04.2024.
This is an open-access article distributed under the terms of the Creative Commons Attribution License (https://
creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any
medium, provided the original work, first published in the Journal of Medical Internet Research, is properly cited. The
complete bibliographic information, a link to the original publication on https://www.jmir.org/, as well as this copyright
and license information must be included.
Abstract
Background
Medical documentation plays a crucial role in clinical practice, facilitating accurate patient man-
agement and communication among health care professionals. However, inaccuracies in medical
notes can lead to miscommunication and diagnostic errors. Additionally, the demands of docu-
mentation contribute to physician burnout. Although intermediaries like medical scribes and
speech recognition software have been used to ease this burden, they have limitations in terms
of accuracy and addressing provider-speci�ic metrics. The integration of ambient arti�icial intelli-
1 of 16 8/11/2024, 4:25 PM
Using ChatGPT-4 to Create Structured Medical Notes From Audio Rec... https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11074889/?report=printable
gence (AI)–powered solutions offers a promising way to improve documentation while �itting
seamlessly into existing work�lows.
Objective
This study aims to assess the accuracy and quality of Subjective, Objective, Assessment, and Plan
(SOAP) notes generated by ChatGPT-4, an AI model, using established transcripts of History and
Physical Examination as the gold standard. We seek to identify potential errors and evaluate the
model’s performance across different categories.
Methods
Results
Although ChatGPT-4 consistently generated SOAP-style notes, there were, on average, 23.6 er-
rors per clinical case, with errors of omission (86%) being the most common, followed by addi-
tion errors (10.5%) and inclusion of incorrect facts (3.2%). There was signi�icant variance be-
tween replicates of the same case, with only 52.9% of data elements reported correctly across all
3 replicates. The accuracy of data elements varied across cases, with the highest accuracy ob-
served in the “Objective” section. Consequently, the measure of note quality, assessed by PDQI,
demonstrated intra- and intercase variance. Finally, the accuracy of ChatGPT-4 was inversely cor-
related to both the transcript length (P=.05) and the number of scorable data elements (P=.05).
Conclusions
Our study reveals substantial variability in errors, accuracy, and note quality generated by
ChatGPT-4. Errors were not limited to speci�ic sections, and the inconsistency in error types
across replicates complicated predictability. Transcript length and data complexity were in-
versely correlated with note accuracy, raising concerns about the model’s effectiveness in han-
dling complex medical cases. The quality and reliability of clinical notes produced by ChatGPT-4
do not meet the standards required for clinical use. Although AI holds promise in health care,
caution should be exercised before widespread adoption. Further research is needed to address
accuracy, variability, and potential errors. ChatGPT-4, while valuable in various applications,
should not be considered a safe alternative to human-generated clinical documentation at this
time.
2 of 16 8/11/2024, 4:25 PM
Using ChatGPT-4 to Create Structured Medical Notes From Audio Rec... https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11074889/?report=printable
Keywords: generative AI, generative arti�icial intelligence, ChatGPT, simulation, large language
model, clinical documentation, quality, accuracy, reproducibility, publicly available, medical note,
medical notes, generation, medical documentation, documentation, documentations, AI, arti�icial
intelligence, transcript, transcripts, ChatGPT-4
Introduction
Medical documentation is an integral aspect of clinical practice, ensuring accuracy and compre-
hensive patient management and serving as a communication tool among health care profes-
sionals. In recent years, it has become increasingly evident that inaccuracies in medical notes
lead to miscommunication, diagnostic discrepancies, and patients’ perceptions of subpar medi-
cal care [1]. Beyond the immediate implications of documentation errors, documentation de-
mands have been identi�ied as a signi�icant contributor to physician burnout [2]. With health
care professionals spending an increasing amount of their working hours on paperwork, there is
less time and energy left for direct patient care.
To counter this, many institutions have adopted the use of intermediaries, such as medical
scribes or speech recognition software, to shoulder the documentation load and allow clinicians
to focus on patient interactions. However, both of these solutions have signi�icant limitations and
concerns regarding documentation accuracy and lack of impact on many provider-speci�ic met-
rics surrounding after-hours charting and burnout [3,4]. In addition, the �inancial implications of
employing medical scribes render them inaccessible to numerous health care practices [5].
Consequently, there is a continued search for innovative solutions to create effective and accu-
rate documentation while seamlessly integrating into existing work�lows.
With the rapid and exponential growth in computing capacity, arti�icial intelligence (AI) is being
increasingly used in health care, holding the promise of revolutionizing medical documentation,
thus potentially alleviating the burden on physicians [6]. AI-powered systems can analyze vast
amounts of data quickly, identify patterns, and suggest diagnostic options. Although the allure of
AI is undeniable, questions regarding its accuracy, reliability, and suitability in the clinical setting
remain. The maturation of speech recognition technology has led to large-scale adoption by
health care organizations, allowing for real-time transcription services. This, combined with
software using large language models (LLMs), now enables the creation of structured medical
notes in close temporal relation to the clinical encounter, thereby decreasing the clinician docu-
mentation burden [7,8]. Multiple software vendors are developing and deploying documentation
assistance software powered by ambient AI, referred to as ambient digital scribes. There is al-
ready signi�icant interest on the part of clinicians and health care organizations to adopt them.
However, little data exist on the safety and quality of the documentation, with analysis made
more dif�icult by the proprietary AI engine used by each vendor.
One such AI system is OpenAI’s ChatGPT-4, a state-of-the-art LLM known for its ability to engage
in text-based communication with users (as a chatbot), which is used in some commercial ambi-
ent digital scribe solutions. Released in November 2022, ChatGPT-4 is trained on a vast amount
of text data from the internet and uses an LLM to answer the users’ prompts. Health care
3 of 16 8/11/2024, 4:25 PM
Using ChatGPT-4 to Create Structured Medical Notes From Audio Rec... https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11074889/?report=printable
providers envision numerous applications for ChatGPT-4, such as answering patient questions,
automated insurance prior authorizations, and creating differential diagnoses [9,10]. It is impor-
tant to note that open AI platforms, such as ChatGPT-4, are not recommended for clinical use due
to the many regulatory and privacy issues. Despite this, there is a continued interest in whether
ChatGPT-4 could serve as a freely available tool to assist as a documentation intermediary, bridg-
ing the gap between health care professionals and the tedious task of recordkeeping.
However, it is imperative that prior to the widespread adoption of these tools, their safety and
ef�icacy need to be evaluated in a structured and clinically contextually relevant manner.
Therefore, the goal of this study was to use transcripts from simulated patient-provider encoun-
ters to determine the accuracy, readability, and reproducibility of ChatGPT-4–generated
Subjective, Objective, Assessment, and Plan (SOAP) notes.
Methods
Overview
As part of a project designed to evaluate the accuracy and ef�icacy of human scribe–generated
notes, we created 14 simulated patient-provider encounters. All encounters used professional
standardized patients and represented a wide range of ambulatory specialties. A standardized
patient is an individual trained to simulate a medical scenario for health care education, assess-
ment, and research. Brie�ly, for each case, a storyboard was created by subject matter experts
and used for training the standardized patient to ensure standard content delivery according to
best practices [11]. After an initial dry-run, each scenario was conducted in a simulated ambula-
tory patient exam room equipped with audio-video capture. At the end of the scenario, audio-
video �iles were exported for use. These cases represented a variety of diagnoses (Table 1).
Audio �iles for each case were professionally transcribed. For each encounter, a list of key re-
portable elements was created for each case using the transcripts and informed by the initial
storyboard, by 2 clinical experts of the study team. This is being used as the scoring rubric for
subsequent analysis. These encounter transcripts were then fed into ChatGPT-4 using a standard
prompt (“generate a clinical note in SOAP format for the following”). The SOAP format is a
widely used clinical documentation format that concatenates data elements of the clinical inter-
view into headers representing SOAP-related components. The SOAP format is a standard model
for medical documentation, providing a clear, concise framework for health care professionals to
record and share patient information. Each transcript was run through the model three times to
assess output �idelity associated with replicability, thus generating three documentation ver-
sions for each case for a total of 42 ChatGPT-4 generated SOAP notes (the prompt and full output
are present in Multimedia Appendix 1). A new discussion space was created for each case to pre-
vent the various transcripts from con�lating each other. Each prompt request was conducted
consecutively within the same discussion space.
After acquiring the generated notes, various comparisons were made to assess the output’s ac-
curacy and quality. Within a single case, the 3 versions were analyzed based on errors generated.
4 of 16 8/11/2024, 4:25 PM
Using ChatGPT-4 to Create Structured Medical Notes From Audio Rec... https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11074889/?report=printable
A list of errors was de�ined as follows: (1) omissions—where expected documentation elements
or data were missing; (2) incorrect—where the data element was referred to but incorrect; and
(3) additions—information added that was not in the transcript. The framework for de�ining
quality in clinical documentation based on omissions, incorrect information, and additions is a
structured approach to evaluate the accuracy and completeness of medical records. These char-
acteristics (omissions, incorrect information, and additions), if present, help de�ine the quality of
documentation given their implications. For example, omissions can lead to gaps in patient care,
misdiagnosis, or delays in treatment. Incorrect information can compromise patient safety and
lead to negative health consequences. Additions, while not always harmful, can be inaccurate
and reduce the ef�iciency of care delivery. This framework is particularly useful in assessing the
performance of health care documentation processes, such as those generated by medical
scribes, and in quantifying appropriate information retrieval [4,12]. A correct data element was
de�ined as one without the previously outlined errors.
To ensure and assess note quality, we outlined critical data elements for each clinical case.
Members of the study team independently selected these crucial data elements and subse-
quently compiled them to guarantee comprehensiveness. They used these elements to generate
a gold standard checklist and an associated gold standard History and Physical Examination
note. Then, 2 raters graded the 3 ChatGPT-4 versions of each encounter based on whether they
correctly included, missed, or wrongly presented the corresponding data element. We enumer-
ated the number of errors and correct data elements for each version. Afterward, we compared
the correct data elements across the 3 ChatGPT-4 versions for presence and consistency, as fol-
lows: (1) across all three versions, (2) across two versions, (3) only in a single version, or (4) not
present at all. Finally, we compared the percentages of appropriate data elements across the ver-
sions to the transcript’s length and the number of data elements.
The data elements were divided into three documentation-related sections: (1) Subjective, fur-
ther subdivided into the history of present illness and other patient-reported information, in-
cluding medications, allergies, family history, social history, and past medical history; (2)
Objective, which includes vital signs, physical exam, and any reported test results; and (3)
Assessment and Plan, which includes the provider reported differential, plan, and follow-up in-
structions. The percentages of correct data elements were then compared based on these cate-
gories.
Lastly, the Physician Documentation Quality Instrument (PDQI) scoring system, which is a vali-
dated tool to assess note quality, was used to evaluate the quality of the generated notes [13].
Using a set of prede�ined criteria, the PDQI facilitates the objective analysis of documentation
practices. Within the PDQI, 9 criteria assess if the document is (1) up to date, (2) accurate, (3)
thorough, (4) useful, (5) organized, (6) comprehensible, (7) succinct, (8) synthesized, and (9)
consistent. The items are then scored based on a 5-point Likert scale, with the highest value rep-
resenting the ideal characteristic. A maximum score of 45 represents the document that ex-
tremely shows the associated attribute, and a minimum score of 9 points indicates that the at-
tribute is not at all present. The PDQI score was calculated for the 3 versions of the generated
note, averaged, and compared across the 14 cases by a member of the study team (AK).
5 of 16 8/11/2024, 4:25 PM
Using ChatGPT-4 to Create Structured Medical Notes From Audio Rec... https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11074889/?report=printable
Statistical Analysis
All statistical analyses were performed using GraphPad Prism (version 10; GraphPad Software
Inc). For between-group comparisons, we used the Friedman test for overall and between-group
comparisons given the nonnormal distribution of the data as determined by the Kolmogorov-
Smirnov test. Pearson r test was used for univariate correlations. A P value <.05 was considered
statistically signi�icant.
Ethical Considerations
The study was deemed exempt from an institutional review board approval as it did not include
human subjects and therefore did not pose any risks.
Results
We �irst looked at the overall structure of the notes. Consistently, ChatGPT-4 was able to generate
a SOAP-style note. Overall, there was a signi�icant variance in note length between the 3 repli-
cates, with the transcripts of some cases being very similar in length, while others showed
nearly a 50% variance between replicates (Figure 1).
We classi�ied errors into 3 types: errors of omission, those related to incorrect facts, and errors
associated with information addition. Overall, the total number of errors ranged from 5.7 to 64.7
errors per case, with signi�icant differences between the replicates (Figure 2A). When we subdi-
vided errors into the 3 basic types, errors of omission were by far the most common, comprising,
on average, 86.3% of all errors, followed by addition errors (10.5%) and incorrect facts (3.2%).
Examples of these types of errors are illustrated in Table 2. There was signi�icant variance both
in the total number and distribution of errors between cases and between replicates of the same
case (Figure 2B).
For accuracy, we assessed the overall congruence between replicates. The frequency of correct
reporting across the 3 replicates was compared against the gold standard History and Physical
Examination. Overall, the mean percentage of elements reported correctly across all 3 replicates
for the 14 cases was 53% (range 22%-79%). Interestingly, nearly 30% of data elements were re-
ported correctly in only 1 or 2 of the replicates, suggesting issues with both accuracy and con-
gruency (Figure 3).
Breaking down ChatGPT-4’s performance based on individual categories, there was a signi�icant
variance in the accuracy in each section of the note. Speci�ically, the highest accuracy was ob-
served in the Objective section of the note (median 86.9, IQR 75.4%-96.9%) and was signi�i-
cantly higher compared to the History and Physical examination (median 63.8%, IQR
54.2%-76.8%; P=–.02), Other (median 75.2%, IQR 68.5%-82.4%), and Assessment and Plan
(median 66.9%, IQR 36.4%-83.5%; Figure 4).
6 of 16 8/11/2024, 4:25 PM
Using ChatGPT-4 to Create Structured Medical Notes From Audio Rec... https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11074889/?report=printable
The combination of variance in note structure as well as the number and type of errors resulted
in similar variance in overall note quality as determined by PDQI-9. Overall, the mean PDQI-9
score was 29.7 (range 23.7-39.7), with signi�icant variance between replicates within a case (
Figure 5).
Finally, we wished to determine whether characteristics in the parent transcript were associated
with note quality. Overall, transcript length and the total number of scorable elements (as a mea-
sure of information density or complexity) both correlated inversely with the total percentage of
elements reported correctly across the 3 replicates for each case (Figure 6). We observed similar
�indings for PDQI-9 (details are not shown).
Discussion
Principal Results
Our study highlights the signi�icant variations in errors, accuracy, and quality of SOAP notes gen-
erated by ChatGPT-4. With regard to errors, they are not limited to speci�ic sections of the note
and include errors of omission as well as commission. Although the number of errors is consis-
tent with regard to the number of data elements, another important �inding is that the error rate
is not consistent across replicates of the same case. This means that the model is not making the
same errors repeatedly, making it dif�icult for health care providers to predict where errors may
occur. This variability introduces a level of unpredictability, which can impact clinical oversight.
In the context of medical research, our investigation has shed light on the critical issue of docu-
mentation accuracy, which has been a recurring concern in prior studies. Our �indings align with
the existing body of research on digital scribes, revealing noteworthy variations in accuracy, par-
ticularly in the context of nonobjective data [4,14]. In the realm of ChatGPT-4, the study con-
ducted by Johnson et al [15] delved into its performance in giving precise and comprehensive
medical information. This inquiry enlisted the expertise of 33 physicians, spanning 17 different
specialties, who formulated questions that were subsequently posed to ChatGPT. Approximately
57.8% of the generated responses were assessed as accurate or nearly correct. This outcome un-
derscores the imperative for exercising caution when solely relying on AI-generated medical in-
formation and the need for continuous evaluation, as others have noted [16]. However, in an-
other study by Walker et al [17] aimed at evaluating the reliability of medical information pro-
vided by ChatGPT-4, multiple iterations of their queries executed through the model yielded a re-
markable 100% internal consistency among the generated outputs [17]. Although promising, it
should be noted that the queries used in their experiment consisted of direct single-sentence
questions pertaining to speci�ic hepatobiliary diagnoses. This mode of input differs signi�icantly
from the transcription of patient encounters. Our research, in contrast, stands out by probing the
reproducibility of note generation—a relatively less explored topic in existing literature.
7 of 16 8/11/2024, 4:25 PM
Using ChatGPT-4 to Create Structured Medical Notes From Audio Rec... https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11074889/?report=printable
The PDQI-9 scores also highlight the overall variance in quality. In previous research, the PDQI-9
score of 26.2 was rated “terrible or bad,” versus a PDQI-9 score of 36.6, which was rated “good or
excellent” [13]. In our study, the mean PDQI-9 score of 29.7 is closer to the “terrible or bad”
range. These observations suggest that although ChatGPT-4 can consistently generate a SOAP-
style note, it introduces errors and struggles with maintaining uniformity and accuracy. These is-
sues could pose potential challenges if implemented in a clinical setting.
An essential aspect of our research was to identify the factors contributing to inaccuracies in AI-
generated notes. Notably, we found an inverse correlation between note accuracy and transcript
length as well as the amount of reportable data. This observation has profound implications for
large language models like ChatGPT-4, indicating their challenges with longer and denser infor-
mation. This raises questions about their effectiveness in handling complex medical cases.
These �indings have signi�icant clinical implications. The high variability in PDQI-9 scores, cou-
pled with a high error rate, indicates low-quality notes. Recently, there have been concerns re-
garding ChatGPT-4’s capacity to generate what can be classi�ied as “hallucinations”—synthe-
sized data that may be misinterpreted as factual information. These data are often incomplete
and sometimes misleading [18]. This has implications for the quality of patient care, potentially
leading to diagnostic errors and eroding trust in AI, both among health care providers and pa-
tients. Acknowledging the increasing documentation burden contributing to physician burnout,
generative AI technology for clinical note documentation may save time [2,19]. However, if our
data are representative of similar accuracy rates with other AI-powered systems, any time sav-
ings could be negated by the need for corrections. This mirrors previous studies with human
scribes, where widespread adoption had little impact on after-hours charting or chart comple-
tion time [20-22].
Limitations
Our research is not without limitations. Primarily, the generated SOAP notes underwent process-
ing through an open AI model, in contrast to the proprietary closed models commonly used in
the generative AI domain of health care. It is pertinent to note that proprietary technologies,
such as DAX Copilot (a collaborative venture of Microsoft and Nuance), have restricted accessi-
bility, available only to entities holding contractual agreements with the parent company.
Furthermore, these models evolve iteratively. Consequently, the errors as well as the correct ele-
ments in our current data set might not manifest in subsequent versions. However, it is impor-
tant to note that the methodology reported here establishes a means by which these systems can
be evaluated systematically. It should be acknowledged that this study only used transcripts,
eliminating the confounder of any potential errors introduced by the speech recognition aspect
[23,24]. Integrating this aspect will be critical for a more complete evaluation of fully integrated
generative AI–powered documentation assistants. Another limitation is the inability to draw
conclusions regarding the correlation between types of cases and associated errors. A substan-
tially larger volume of encounters would be required to delineate this relationship. Additionally,
despite its standardized criteria, using the PDQI can still be in�luenced by the subjective judg-
ment of the reviewer and can be a time-consuming process, particularly for longer documents.
8 of 16 8/11/2024, 4:25 PM
Using ChatGPT-4 to Create Structured Medical Notes From Audio Rec... https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11074889/?report=printable
Conversely, the instrument does cover a broad range of quality dimensions, facilitating a more
holistic evaluation. Furthermore, it can be used as a diagnostic tool to identify strengths and
weaknesses, guiding targeted quality improvement initiatives. Finally, in large language models,
such as ChatGPT-4, the temperature of the model is a parameter that controls the randomness or
predictability of the model’s output. It is a component that tunes the model to generate re-
sponses that are either more varied and creative or more deterministic and conservative. With
this in mind, ChatGPT-4’s temperature allows for variability, but this setting is not accessible to
the end user [25,26]. Further, even setting the temperature to zero does not appear to ensure
uniformity of response [27]. Along these lines, the absence of real-time feedback within the ap-
plication also limits the model’s ability to adjust its responses based on user input, and there-
fore, hinders the model’s opportunity to learn from real-world interactions and re�ine its output.
Conclusions
Abbreviations
AI arti�icial intelligence
Multimedia Appendix 1
ChatGPT-4 responses.
Notes
Data Availability
9 of 16 8/11/2024, 4:25 PM
Using ChatGPT-4 to Create Structured Medical Notes From Audio Rec... https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11074889/?report=printable
The data sets generated and analyzed during this study are available from the corresponding au-
thor on reasonable request.
Footnotes
References
1. Bell SK, Delbanco T, Elmore JG, Fitzgerald PS, Fossa A, Harcourt K, Leveille SG, Payne TH, Stametz RA, Walker J,
DesRoches CM. Frequency and types of patient-reported errors in electronic health record ambulatory care notes. JAMA
Netw Open. 2020 Jun 01;3(6):e205867. doi: 10.1001/jamanetworkopen.2020.5867. https://jamanetwork.com/journals/
jamanetworkopen/fullarticle/10.1001/jamanetworkopen.2020.5867 .2766834 [PMCID: PMC7284300] [PubMed:
32515797] [CrossRef: 10.1001/jamanetworkopen.2020.5867]
2. Gaffney A, Woolhandler S, Cai C, Bor D, Himmelstein J, McCormick D, Himmelstein DU. Medical documentation burden
among US of�ice-based physicians in 2019: a national study. JAMA Intern Med. 2022 May 01;182(5):564–566.
doi: 10.1001/jamainternmed.2022.0372. https://europepmc.org/abstract/MED/35344006 .2790396 [PMCID:
PMC8961402] [PubMed: 35344006] [CrossRef: 10.1001/jamainternmed.2022.0372]
3. Florig ST, Corby S, Rosson NT, Devara T, Weiskopf NG, Gold JA, Mohan V. Chart completion time of attending physicians
while using medical scribes. AMIA Annu Symp Proc. 2021;2021:457–465. https://europepmc.org/abstract/
MED/35308986 .3577422 [PMCID: PMC8861674] [PubMed: 35308986]
4. Pranaat R, Mohan V, O'Reilly M, Hirsh M, McGrath K, Scholl G, Woodcock D, Gold JA. Use of simulation based on an
electronic health records environment to evaluate the structure and accuracy of notes generated by medical scribes:
proof-of-concept study. JMIR Med Inform. 2017 Sep 20;5(3):e30. doi: 10.2196/medinform.7883. https://
medinform.jmir.org/2017/3/e30/ v5i3e30 [PMCID: PMC5628287] [PubMed: 28931497] [CrossRef: 10.2196/
medinform.7883]
5. Corby S, Whittaker K, Ash JS, Mohan V, Becton J, Solberg N, Bergstrom R, Orwoll B, Hoekstra C, Gold JA. The future of
medical scribes documenting in the electronic health record: results of an expert consensus conference. BMC Med Inform
Decis Mak. 2021 Jun 29;21(1):204. doi: 10.1186/s12911-021-01560-4. https://
bmcmedinformdecismak.biomedcentral.com/articles/10.1186/s12911-021-01560-4 .10.1186/s12911-021-01560-4
[PMCID: PMC8240616] [PubMed: 34187457] [CrossRef: 10.1186/s12911-021-01560-4]
6. Ahuja AS. The impact of arti�icial intelligence in medicine on the future role of the physician. PeerJ. 2019;7:e7702.
doi: 10.7717/peerj.7702. doi: 10.7717/peerj.7702.7702 [PMCID: PMC6779111] [PubMed: 31592346] [CrossRef:
10.7717/peerj.7702] [CrossRef: 10.7717/peerj.7702]
7. Hickenlooper J, Boyter M, Sycamore T. Clinical documentation strategies 2023. Klasresearch. [2024-04-11]. https://
klasresearch.com/report/clinical-documentation-strategies-2023-examining-which-options-best-�it-your-needs/2763 .
8. Goss FR, Blackley SV, Ortega CA, Kowalski LT, Landman AB, Lin C, Meteer M, Bakes S, Gradwohl SC, Bates DW, Zhou L. A
clinician survey of using speech recognition for clinical documentation in the electronic health record. Int J Med Inform.
2019 Oct;130:103938. doi: 10.1016/j.ijmedinf.2019.07.017.S1386-5056(19)30473-3 [PubMed: 31442847] [CrossRef:
10.1016/j.ijmedinf.2019.07.017]
10 of 16 8/11/2024, 4:25 PM
Using ChatGPT-4 to Create Structured Medical Notes From Audio Rec... https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11074889/?report=printable
9. Hirosawa T, Harada Y, Yokose M, Sakamoto T, Kawamura R, Shimizu T. diagnostic accuracy of differential-diagnosis lists
generated by generative pretrained transformer 3 chatbot for clinical vignettes with common chief complaints: a pilot
study. Int J Environ Res Public Health. 2023 Feb 15;20(4):3378. doi: 10.3390/ijerph20043378. doi: 10.3390/
ijerph20043378.ijerph20043378 [PMCID: PMC9967747] [PubMed: 36834073] [CrossRef: 10.3390/ijerph20043378]
[CrossRef: 10.3390/ijerph20043378]
10. Dahmen J, Kayaalp ME, Ollivier M, Pareek A, Hirschmann MT, Karlsson J, Winkler PW. Arti�icial intelligence bot
ChatGPT in medical research: the potential game changer as a double-edged sword. Knee Surg Sports Traumatol Arthrosc.
2023 Apr;31(4):1187–1189. doi: 10.1007/s00167-023-07355-6.10.1007/s00167-023-07355-6 [PubMed: 36809511]
[CrossRef: 10.1007/s00167-023-07355-6]
11. Lewis K, Bohnert CA, Gammon WL, Hö lzer H, Lyman L, Smith C, Thompson TM, Wallace Amelia, Gliva-McConvey
Gayle. The Association of Standardized Patient Educators (ASPE) Standards of Best Practice (SOBP) Adv Simul (Lond)
2017;2:10. doi: 10.1186/s41077-017-0043-4. https://advancesinsimulation.biomedcentral.com/articles/10.1186/
s41077-017-0043-4 .43 [PMCID: PMC5806371] [PubMed: 29450011] [CrossRef: 10.1186/s41077-017-0043-4]
12. Artis KA, Bordley J, Mohan V, Gold JA. Data omission by physician trainees on ICU rounds*. Critical Care Medicine.
2019;47(3):403–409. doi: 10.1097/ccm.0000000000003557. [PMCID: PMC6407821] [PubMed: 30585789] [CrossRef:
10.1097/ccm.0000000000003557]
13. Stetson PD, Bakken S, Wrenn JO, Siegler EL. Assessing electronic note quality using the Physician Documentation
Quality Instrument (PDQI-9) Appl Clin Inform. 2012;3(2):164–174. doi: 10.4338/aci-2011-11-ra-0070. https://
europepmc.org/abstract/MED/22577483 . [PMCID: PMC3347480] [PubMed: 22577483] [CrossRef: 10.4338/
aci-2011-11-ra-0070]
14. Rule A, Florig ST, Bedrick S, Mohan V, Gold JA, Hribar MR. Comparing scribed and non-scribed outpatient progress
notes. AMIA Annu Symp Proc. 2021;2021:1059–1068. https://europepmc.org/abstract/MED/35309010 .3577313
[PMCID: PMC8861667] [PubMed: 35309010]
15. Johnson D, Goodman R, Patrinely J, Stone C, Zimmerman E, Donald R, Chang S, Berkowitz Sean, Finn Avni, Jahangir
Eiman, Scoville Elizabeth, Reese Tyler, Friedman Debra, Bastarache Julie, van der Heijden Yuri, Wright Jordan, Carter
Nicholas, Alexander Matthew, Choe Jennifer, Chastain Cody, Zic John, Horst Sara, Turker Isik, Agarwal Rajiv, Osmundson
Evan, Idrees Kamran, Kieman Colleen, Padmanabhan Chandrasekhar, Bailey Christina, Schlegel Cameron, Chambless
Lola, Gibson Mike, Osterman Travis, Wheless Lee. Assessing the accuracy and reliability of AI-generated medical
responses: an evaluation of the Chat-GPT model. Res Sq. 2023 Feb 28;43(6):622–626. doi: 10.21203/rs.3.rs-2566942/v1.
https://europepmc.org/abstract/MED/36909565 .rs.3.rs-2566942 [CrossRef: 10.21203/rs.3.rs-2566942/v1]
16. Preiksaitis C, Sinsky CA, Rose C. ChatGPT is not the solution to physicians' documentation burden. Nat Med. 2023
Jun;29(6):1296–1297. doi: 10.1038/s41591-023-02341-4. doi: 10.1038/s41591-023-02341-4.10.1038/
s41591-023-02341-4 [PubMed: 37169865] [CrossRef: 10.1038/s41591-023-02341-4] [CrossRef: 10.1038/
s41591-023-02341-4]
17. Walker HL, Ghani S, Kuemmerli C, Nebiker CA, Mü ller BP, Raptis DA, Staubli SM. Reliability of medical information
provided by ChatGPT: assessment against clinical guidelines and patient information quality instrument. J Med Internet
Res. 2023 Jun 30;25:e47479. doi: 10.2196/47479. https://www.jmir.org/2023//e47479/ v25i1e47479 [PMCID:
PMC10365578] [PubMed: 37389908] [CrossRef: 10.2196/47479]
18. Van Bulck L, Moons P. What if your patient switches from Dr. Google to Dr ChatGPT? A vignette-based survey of the
trustworthiness, value and danger of ChatGPT-generated responses to health questions. Eur J Cardiovasc Nurs. 2023 Apr
11 of 16 8/11/2024, 4:25 PM
Using ChatGPT-4 to Create Structured Medical Notes From Audio Rec... https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11074889/?report=printable
19. Kroth PJ, Morioka-Douglas N, Veres S, Babbott S, Poplau S, Qeadan F, Parshall C, Corrigan K, Linzer M. Association of
electronic health record design and use factors with clinician stress and burnout. JAMA Netw Open. 2019 Aug
02;2(8):e199609. doi: 10.1001/jamanetworkopen.2019.9609.2748054 [PMCID: PMC6704736] [PubMed: 31418810]
[CrossRef: 10.1001/jamanetworkopen.2019.9609]
20. Florig ST, Corby S, Devara T, Weiskopf NG, Mohan V, Gold JA. Medical record closure practices of physicians before and
after the use of medical scribes. JAMA. 2022 Oct 04;328(13):1350. doi: 10.1001/jama.2022.13558. [PMCID:
PMC9437823] [PubMed: 36048452] [CrossRef: 10.1001/jama.2022.13558]
21. Gidwani R, Nguyen C, Kofoed A, Carragee C, Rydel T, Nelligan I, Sattler A, Mahoney M, Lin S. Impact of Scribes on
Physician Satisfaction, Patient Satisfaction, and Charting Ef�iciency: A Randomized Controlled Trial. The Annals of Family
Medicine. 2017;15(5):427–433. doi: 10.1370/afm.2122. [PMCID: PMC5593725] [PubMed: 28893812] [CrossRef:
10.1370/afm.2122]
22. Jhaveri P, Abdulahad D, Fogel B, Chuang C, Lehman E, Chawla L, Foley K, Phillips T, Levi B. Impact of scribe
intervention on documentation in an outpatient pediatric primary care practice. Acad Pediatr. 2022 Mar;22(2):289–295.
doi: 10.1016/j.acap.2021.05.004.S1876-2859(21)00256-4 [PubMed: 34020102] [CrossRef: 10.1016/
j.acap.2021.05.004]
23. Hodgson T, Magrabi F, Coiera E. Ef�iciency and safety of speech recognition for documentation in the electronic health
record. J Am Med Inform Assoc. 2017 Nov 01;24(6):1127–1133. doi: 10.1093/jamia/ocx073. https://europepmc.org/
abstract/MED/29016971 .4049461 [PMCID: PMC7651984] [PubMed: 29016971] [CrossRef: 10.1093/jamia/ocx073]
24. Mohr DN, Turner DW, Pond GR, Kamath JS, De Vos CB, Carpenter PC. Speech recognition as a transcription aid: a
randomized comparison with standard transcription. J Am Med Inform Assoc. 2003;10(1):85–93. doi: 10.1197/
jamia.m1130. https://europepmc.org/abstract/MED/12509359 . [PMCID: PMC150361] [PubMed: 12509359] [CrossRef:
10.1197/jamia.m1130]
25. Peng K, Ding L, Zhong Q, Shen L, Liu X, Zhang M, Ouyang Y, Tao D. Towards making the most of ChatGPT for machine
translation. SSRN Journal. 2023:1–9. doi: 10.2139/ssrn.4390455. [CrossRef: 10.2139/ssrn.4390455]
26. Guo Q, Cao J, Xie X. Exploring the potential of chatgpt in automated code re�inement: an empirical study. ICSE '24:
Proceedings of the 46th IEEE/ACM International Conference on Software Engineering; April 14 - 20; Lisbon, Portugal.
2024. [CrossRef: 10.1145/3597503.3623306]
27. Ouyang S, Zhang J, Harman M. LLM is like a box of chocolates: the non-determinism of ChatGPT in code generation.
arXiv. doi: 10.48550/arXiv.2308.02828. Preprint posted online on Aug 5, 2023. [CrossRef: 10.48550/arXiv.2308.02828]
12 of 16 8/11/2024, 4:25 PM
Using ChatGPT-4 to Create Structured Medical Notes From Audio Rec... https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11074889/?report=printable
Table 1
Case number and associated diagnosis (14 simulated patient-provider encounter transcripts representing a variety of di-
agnoses).
1 Gastroenteritis
3 Diabetic ketoacidosis
4 Ovarian cyst
5 Pneumonia
6 Menstrual migraine
7 Breast mass
8 Heart failure
9 Polymyalgia rheumatica
12 Diverticulitis
13 Scleroderma
14 Colon cancer
Figure 1
ChatGPT-4–generated note length per case (a comparison of the 14 cases versus the ChatGPT-4–generated note lengths).
13 of 16 8/11/2024, 4:25 PM
Using ChatGPT-4 to Create Structured Medical Notes From Audio Rec... https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11074889/?report=printable
Figure 2
Accuracy of ChatGPT-4–generated notes (variations in errors). (A) The 3 ChatGPT-4–generated note replicates were com-
pared based on the total number of error events per case and based on (B) omissions, incorrect facts, and addition errors
per case.
14 of 16 8/11/2024, 4:25 PM
Using ChatGPT-4 to Create Structured Medical Notes From Audio Rec... https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11074889/?report=printable
Table 2
Error examples (examples of omission, incorrect facts, and addition errors seen in the generated notes).
Omissions
Case 2: incarcerated inguinal hernia • Failed to mention the lack of appetite or blood in vomit on review of syste
Incorrect facts
Case 11: decreased fetal movement • Added the fetus was measuring 3 weeks behind expected gestational age
expected with no quanti�ication
Additions
Case 7: breast mass • Stated weight loss was intentional when this was not mentioned.
Case 8: heart failure • Added the patient was noncompliant with medication when compliance w
Case 9: polymyalgia rheumatica • Added additional labs and consults that were not mentioned.
Figure 3
The reproducibility of note accuracy of the ChatGPT-4–generated notes. The percentages of data elements that were re-
ported correctly across 3, 2, 1, or 0 ChatGPT-4–generated replicates were compared across cases.
15 of 16 8/11/2024, 4:25 PM
Using ChatGPT-4 to Create Structured Medical Notes From Audio Rec... https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11074889/?report=printable
Figure 4
The percentages of correct elements averaged per case based on note category. Each transcript was run through
ChatGPT-4 three times and the percentages of correct data elements were averaged across the replicates. The data ele-
ments were divided into History of present illness (HPI), Other (eg, medications, allergies, family history, social history,
and past medical history), Objective (eg, vital signs, physical exam, and test results), and Assessment and Plan (A/P). The
average percentage of correct data in each case was compared based on these documentation categories. The overall dif-
ference between groups was signi�icant (P=.02). * indicates there was a statistically signi�icant difference between the
HPI and the Objective sections (P<.05 was considered signi�icant).
Figure 5
Quality of ChatGPT-4 notes per case. The Physician Documentation Quality Instrument-9 (PDQI-9) scoring system was
used to evaluate the quality of the generated notes and then compared across the 14 cases.
Figure 6
The accuracy of the ChatGPT-4–generated notes. The percentage of correct data elements present in all 3 note replicates
was compared against (A) the original transcript length and (B) the number of data elements per case.
16 of 16 8/11/2024, 4:25 PM