Review of Data-Driven Generative AI Models For Knowledge Extraction From Scientific Literature in Healthcare
Review of Data-Driven Generative AI Models For Knowledge Extraction From Scientific Literature in Healthcare
Abstract
This review examines the development of abstractive natural language processing (NLP)-based
text summarization approaches and compares them to existing techniques for extractive
summarization. A brief history of text summarization from the 1950s to the introduction of
pretrained language models such as Bidirectional Encoder Representations from Transformer
(BERT) and Generative Pre-training Transformers (GPT) are presented. In total, 60 studies were
identified in PubMed and Web of Science, of which 29 were excluded and 24 were read and
evaluated for eligibility, resulting in the use of seven studies for further analysis. This chapter also
includes a section with examples including an example of a comparison between GPT-3 and state-
of-the-art GPT-4 solutions in scientific text summarization. Natural language processing has not
yet reached its full potential in the generation of brief textual summaries. As there are
acknowledged concerns that must be addressed, we can expect gradual introduction of such
models in practice.
Keywords: Abstractive summarization; AI; ChatGPT; Deep neural networks; Few-shot learning;
GPT-4; Healthcare; Knowledge extraction; Large language models; NLP; Zero-shot learning.
1. Introduction
Recent progress in natural language processing (NLP) research using large pretrained
language models using deep neural networks (DNNs) where the number of parameters is
in the order of 100 billion, commonly called large language models (LLM), have pushed
the limits of language understanding and text generation. Standard practice for such
general-purpose models is to fine-tune them for task-specific downstream tasks, often via
text prompts. This review presents a current overview of such models in healthcare with
Next Generation eHealth. https://doi.org/10.1016/B978-0-443-13619-1.00007-6
Copyright © 2025 Elsevier Inc. All rights are reserved, including those for text and data mining, AI training, and similar technologies.
127
128 Chapter 7
the downstream task of knowledge extraction from scientific literature, more specifically,
abstractive summarization. Extracting very short and concise summaries of the paper
content can save healthcare experts’ time when browsing through the results of search
queries, especially in cases when advice to the patient should be supported by the latest
scientific literature immediately at the point of care. The review consists of a brief history
of the development of the NLP summarization approaches and proceeds with familiarizing
the reader with the methodology of the current state-of-the-art (SOTA) techniques. SOTA
models are also examined, with a focus on models that can generate abstractive summaries,
such as OpenAI’s GPT-3 Curie and Davinci (Korngiebel & Mooney, 2021; OpenAI, 2022)
with Too Long; Didn’t Read (TLDR) summarization. These models are compared to
existing models fine-tuned for extractive summaries in the healthcare domain, such as the
semantic scholar platform, which employs controlled abstraction for TLDRs with title
scaffolding (CATTS) (Cachola et al., 2020). Opportunities and problems that arise with
abstractive summaries are addressed. For example, one such problem is the evaluation of
abstractive summaries since automatic methods, such as Recall-Oriented Understudy for
Gisting Evaluation (ROUGE), that are extensively used for extractive summaries; for
example, it was used for 2/3 of the summarization evaluations in summarization papers from
North American Chapter of the Association for Computational Linguistics (NAACL) and
Association for Computational Linguistics (ACL) in 2021 (Kasai et al., 2022) but might not
be a viable option when using LLM. Current trends and best practice guidelines are still in
favor of human evaluations (van der Lee et al., 2021). However, some suggestions of
possible reference-free metrics for natural language generation (NLG) evaluations, such as
G-Eval (Liu et al., 2023) and GPTScore (Fu et al., 2023), are applicable to new tasks
without human references and might eventually replace human evaluation.
The review covers the following topics: Introduction to a brief history of text
summarization using NLP, SOTA NLP summarization approaches, applications in the
healthcare domain, showcased evaluations of abstractive summaries generated by SOTA
models, limitations, and current challenges of existing methods. We see an additional
value of such a review in providing important feedback for developers of such models for
future improvements.
The history of text summarization begins with the development of NLP. The origin of
NLP is usually attributed to machine translation, which was used during the second world
war to translate the English language to Russian and the other way around (Sreelekha
et al., 2016).
During the 1950s and 1960s, Luhn and Edmundson have been studying ways of
automatically extracting documents using different approaches: the topic is hidden within
Review of data-driven generative AI models 129
initial sentences (e.g., in the first paragraph, immediately after sections such as
“Introduction,” “Purpose,” “Conclusions”), topic sentences are located with the presence of
pragmatic words (“significant,” “impossible,” “hardly”) and location-wise, where the key
information is hidden in the first and the last sentence of paragraphs (Edmundson, 1969;
Luhn, 1958).
The rise of text summarization appeared in the 1980s when researchers perceived NLP as
an opportunity for a topic for research. These days, a rule-based algorithm known as an
importance evaluator was predominantly used in text summarization. The importance
evaluator used two knowledge bases, the importance rule base that contained knowledge
conveyed through IF-THEN rules, and “encyclopedia” containing domain-specific world
knowledge constituted of a network of frames (Fum et al., 1986).
Later, the development of text summarization progressed with the introduction of the
diversity-based approach in extractive summarization. This approach calculated the
diversity of sentences and removed redundant sentences from the final abstract. Diversity
was calculated using the K-means clustering algorithm with Minimum Description Length
Principle (Nomoto & Matsumoto, 2001).
Another way to extract sentences was to apply graph-based ranking algorithms. These
algorithms determine the importance of a node within a graph according to global
information computed recursively from the entire graph. TextRank sentence extraction
algorithm uses a similar principle, instead, the graph is built from natural language text
and includes multiple or partial links, represented as the graph’s edges, that are extracted
from the text (Mihalcea, 2004).
In 2009, Suanmali et al. introduced text summarization based on the general statistic
method (GSM) and fuzzy logic method to decide on the importance of the text. This
method consisted of four components: fuzzifier, inference engine, defuzzifier, and fuzzy
knowledge base (Suanmali et al., 2009).
A few years later, the encoding and decoding model was developed: Sequence-to-sequence
(Seq2Seq) learning model. The architecture of Seq2Seq was composed of an encoder, a
recurrent neural network (RNN) with LSTM/GRU, and a decoder (RNN). The role of an
encoder was to encode the input sequence into a single fixed-size vector, while the
decoder yielded an output sequence based on the fixed-size vector. Performance issues
with Seq2Seq arose when long and complex sentences were tried to be encoded into a
single fixed-size vector (Sutskever et al., 2014). This was later resolved with an attention
mechanism technique, which identified relevant input tokens, and elements of the text, by
computing context vectors for all tokens in the input sentence for each token in the output
sentence (Bahdanau et al., 2015). Note that tokens are basic unit in many LLM, usually
defined as groups of characters and used to compute the length of text, for example in
130 Chapter 7
GPT models 1000 tokens represent approximately 750 words and a paragraph usually
consists of around 35 tokens (OpenAI, 2022).
Based on the achieved improvements with the attention mechanism technique, Vaswani
et al. proposed transformer architecture, which uses self-attention that selectively focuses
on parts of the input, and next, multihead attention that deals with multiple parts
simultaneously. Furthermore, it does not use recurrence or convolution and it relies on
position-wise feed-forward network architecture (Vaswani et al., 2017).
The evolution of pretrained language models began with the introduction of Bidirectional
Encoder Representations from Transformer (BERT). BERT is based on transformer
architecture, using Masked Language Modeling (MLM) and it is pretrained on a vast
amount of text data, such as articles, books, and other literature. Reportedly it is pretrained
on over 3.3 billion words (Devlin et al., 2019). Like other LLM that will be introduced
later, BERT uses a transfer learning technique, specifically sequential transfer learning that
consists of two steps: the pretraining stage, where NN is trained on general data, and fine-
tuning stage where the model is trained on domain-specific task (Devlin et al., 2019;
Zhang et al., 2019).
In the same year, the corporation OpenAI developed the first one in a series of Generative
Pre-training Transformers (GPT), known as GPT version 1 (GPT-1) (Radford et al., 2018).
A year later GPT-2 (with 1.5 billion parameters) (Jeffrey & Rewon, 2019), and GPT-3, in
2020 (Brown et al., 2020). GPT-3 is pretrained on 100 times more parameters than its
predecessor, precisely on 175 billion parameters (Brown et al., 2020).
During the development of the LLM series by OpenAI, in 2020, Google’s teams
introduced the PEGASUS model that utilizes MLM together with gap sentence generation
(GSG). Their model removes/masks important sentences from the input text and generates
new sentences from the remaining sentences. PEGASUS generates great results with as
many as 1000 examples (Zhang, Zhao, et al., 2020).
In late 2022, ChatGPT, another variant of GPT, was released. ChatGPT is used as a
chatbot built on top of GPT3.5 and has the ability to provide short abstracts on demand
(Aydın & Karaarslan, 2022; Jiao et al., 2023). It should be noted that in the first quarter of
2023 GPT-4 was released, and although no specific details of the model are known
(Models - OpenAI API, 2023), but early indications do imply improvements in multiple
downstream tasks compared to previous iterations.
It should be noted, that in the last quarter of 2023 GPT-4 Turbo was released (OpenAI,
2023), which includes multiple api improvements such as a 128K context window, use of
api AI agents with optional RAG integration, called assistants, as well as new modalities
such as vision and speech (New Models and Developer Products Announced at DevDay,
Review of data-driven generative AI models 131
2. Methods
In this section, we introduce the basic terminology relevant to short summarization models
and study selection for this review.
Extractive summarization is a technique that extracts the most important sentences from
the text and arranges them into a logical summary (Gupta & Gupta, 2019) without
generating any new sentences nor altering the original text. Abstractive summarization is a
technique that summarizes text by understanding the content and produces new or
rephrased sentences of the original text. It is considered to be one of the more challenging
tasks in NLP since it combines the understanding of long passages, information
compression, and language generation (Zhang, Zhao, et al., 2020). The aim of the
abstractive summarization is to produce human-like summaries that successfully capture
the meaning of the source text.
Challenges will be addressed in the subsection “2.3.4 Limitations and challenges of
existing SOTA models.”
On the other hand, few-shot learning (Bataa & Wu, 2020; Chintagunta et al., 2021)
supplies the learner with unseen samples of data to learn a new task (Kojima et al., 2022).
In scenarios when few-shot learning is utilized, it is often sufficient to apply only a couple
of examples. Speaking of meta-learners, Brown et al. demonstrated that zero-, one-, and
few-shot performance increases with the increase of model capacity, indicating that larger
models are more accomplished (Brown et al., 2020).
132 Chapter 7
In one of the previous studies, researchers showed that the few-shot GPT-3 model
performs poorly in the biomedical domain, specifically, in experiments that deal with
textual inference, estimating semantic similarity, question answering, and others (Moradi
et al., 2021).
Our general research question was to investigate to what extent LLM can be used to
extract viable knowledge in healthcare settings. The following search engines were used
for conducting a scoping review: PubMed and Web of Science. We limited the review to
the studies that were conducted in the past 5 years (2017e22). Studies were retrieved in
December 2022 using the following search term:
(((text OR abstract OR scientific literature OR scientific document) AND (summarization
OR summarisation OR TLDR)) AND (large language models OR LLM OR NLP OR
NLP) AND (healthcare OR health care OR primary care OR patient care OR nursing OR
medical care)).
The screening was carried out according to the guidelines of Arksey and O’Malley (2005).
In the first step, duplicate studies were removed. The remaining studies were assessed by
checking the title and abstract for the presence of words such as “summari(s/z)ation,”
“TLDR,” terms that relate to healthcare, and phrases that resemble and belong to the NLP
domain. Systematic and other reviews were included and reviewed regardless of the
satisfaction of the above criteria. Included studies were then skimmed and reviewed for
relevance. We have excluded research in languages other than English. Studies that met
those given criteria underwent full-text reading individually by two authors. In case of
disagreement, the authors resolved their differences through discussion.
The study selection process is also displayed in the result section using a PRISMA flow
(Page et al., 2022). Additionally, in the Results section, we demonstrate some examples of
evaluations of abstract summaries generated by SOTA models.
3. Results
3.1 Search results
Following PRISMA diagram (Fig. 7.1), a total of 60 studies were identified in PubMed
(45) and Web of Science (15). After the removal of duplicated records, 54 studies
underwent a screening process. During the initial step of screening, 29 studies were
Review of data-driven generative AI models 133
Identification
Records identified from:
Records removed before screening:
Pubmed (n = 45)
Duplicate records removed (n = 6)
Web of Science (n = 15)
Records screened
Records excluded (n = 29)
(n = 54)
Screening
Reports excluded:
Review did not cover relevant topics
Reports assessed for
(n = 7)
eligibility (n = 24)
Keyword extraction/identification
(n = 8)
Connection
identification/summarization (n = 2)
Included
Studies included in
review (n = 7)
Figure 7.1
PRISMA diagram.
134 Chapter 7
excluded since either the title or the content of the abstract did not cover desired topics
described in the subsection “study selection,” and one study could not be retrieved. In the
next step, 24 studies were fully read and assessed for eligibility, where seven reviews did
not cover relevant topics, and eight studies reported only keyword/symptom identification
or extraction without resulting in any form of text summaries. Additionally, two studies
were excluded due to summarizing connections in a form of tabular summarization or
their identification. Thus, the selection process yielded seven studies that were relevant to
the review process.
3.1.1 Characteristics of the included studies
All included studies were published as a scientific article either as a part of conference
proceedings (4) or journals (3). One study was published in 2017, 2018, and 2020, while
five studies were published in 2022. Almost half of the included studies were conducted in
the USA (42%), remaining studies were based in Europe (Table 7.1). Four studies included
extractive text summarization in their research, while three studies dealt with abstractive
summarization.
Type of text
Authors Year Country Category Data source summarization Type of learning Goal
Sadeh- 2022 USA Retrospective Clinical session Extractive N/A Examine the use of
Sharvit et al. study session summaries
Trivedi et al. 2018 USA Research Signout notes Extractive Semi-supervised New tool proposal for
article signout note
preparation
Lee and 2019 USA Research Clinical/ Extractive Mutlistage process: Proposing CERC
Uppal article biomedical text TF-IDF, random system built
forests and indicators
of importance
Goldstein 2017 Israel Research Clinical records Abstractive Few-shot Examine automated
et al. article creation of
In the selected studies, researchers carried out an evaluation of generated summaries using
various techniques. Lee et al. evaluated generated texts using ROUGE metrics, in
particular ROUGE-1, ROUGE-2, and ROUGE-SU4, comparing generated text with the
reference text according to overlapping unigrams, bigrams, and n-grams with the
maximum skip distance of 4, respectively. The ROUGE metric is one of the most widely
evaluation techniques used in general (Lin & Och, 2004; Yan et al., 2011).
The study of Reunamo et al. reported using a manual evaluation scale of four classes that
evaluates the quality of information being conveyed (Reunamo et al., 2022). Similarly,
Kocbek et al. used “Informativeness” as one of the evaluation measures (Kocbek et al.,
2022).
Measures like “readability,” “comprehensiveness,” “relevance,” “naturalness,” and
“quality,” were applied only once through all seven studies, where “readability” and
“comprehensiveness” were included in the study of Goldstein et al., while among others,
“naturalness” and “quality” were present in the study of Kocbek et al. (2022) and van der
Lee et al. (2021).
The assessment criteria can also encompass other quality measures, such as clinical course
and continuity of care. The clinical course measures the effectiveness of the text in
Review of data-driven generative AI models 137
enabling clinicians to understand the events and experiences of the patient throughout their
ICU stay; meanwhile, the continuity of care measures the extent to which the information
provided in the text supports the flowless continuation of care (Goldstein et al., 2017).
The first example is from Reunamo et al. and is called an explainer extractor, which
extracts keywords and key phrases through explainable AI, more specifically it combines a
text classification model (bidirectional LSTM) with model explainability (local
interpretable model-agnostic explanations [LIME]) (Reunamo et al., 2022). An example
can be seen in Fig. 7.2, where though it might not provide a full sentence, it does provide
a reasonable explanation and could we have turned into sentences with advanced
transformer-based methods. First part shows the keywords that were recognized as
coefficients of LimeTextExplainer with the highest absolute values. Coefficients were then
Z-standardized and weighted by paragraph score and colored accordingly. In the second
part, keywords that appeared consecutively were merged together to form key phrases, and
Figure 7.2
Word scoring example of a paragraph.
138 Chapter 7
Figure 7.3
Qualitative evaluation of short summaries with annotations (1). From Stiglic, G., Musovic, K.,
Gosak, L., Fijacko, N., & Kocbek, P. (2022). Relevance of automated generated short summaries of scientific
abstract: use case scenario in healthcare. In Proceedings - 2022 IEEE 10th International Conference on
Healthcare Informatics, ICHI 2022 (pp. 599e605). Institute of Electrical and Electronics Engineers Inc.
https://doi.org/10.1109/ICHI54592.2022.00118.
the key phrase scores were determined based on the highest scores of the individual
components. The last step shows the result after the removal of stop words and duplicate
keywords.
Next, we look at a qualitative example from the abstract by Stine et al. (2021) and
compare the results of four different LLMs (Fig. 7.3). Semantic scholar summarized by
extracting information directly from study conclusion. OpenAI Davinci summarized an
abstract into a short statement without any supportive information. Interestingly, OpenAI
Curie included a sentence that includes the guidelines of The American College of Sports
Review of data-driven generative AI models 139
Medicine (ACSM), which is not even mentioned in the original abstract, while completely
ignoring nonalcoholic fatty liver disease (NAFLD). Alike Semantic Scholar, PEGASUS-
Xsum generated short summary by extracting the content from the results.
Another example of the qualitative evaluation is based on the abstract by
Kanellopoulou et al. (2021) (Fig. 7.4). Here, Semantic scholar (SS_tldr_baseline)
generated short summary mainly from the objective and conclusions. While in the
previous example OpenAI Davinci provided a general statement, this time it extracted
the beginning of the result. Similarly, to the previous example, OpenAI Curie
(OpenAI_tldr_Curie) added an additional information to this generated short summary
as well. This time it clearly expressed its opinion by the addition of the following
sentence: “I’m not sure what to make of this study.” Pegasus-Xsum summarized the
abstract in a very general manner, while Bart-SAMSum extracted a part of the results
without any supporting numbers.
Below we provide two examples of improvements in short summarizations using GPT-4
compared to GPT-3 and older models (Fig. 7.5), in the first case, the improvement seems
minor (Fig. 7.5A), whereas in the second case, there is a major improvement (Fig. 7.5B).
In the example of the study by Spaulding et al. (2021) (Fig. 7.5A), semantic scholar and
GPT-4 were relying on extractive summarization technique. While semantic scholar
provided summary directly from the conclusions, GPT-4 extended that part by
unabbreviating a term and pointing out the type of study that was conducted. One of the
LLMs, OpenAI Davinci included a sentence that resembles a reviewer comment, while
Pegasus-Xsum supplied a false information. Another example (Fig. 7.5B), which is shown
on the study by Gutierrez et al. (2021), OpenAI Davinci, Curie, and Pegasus-Xsum
achieved much lower average relevance score than Semantic scholar. Enriched and
improved summarization in comparison to GPT-3 models.
Models generate repetitive and generic summaries especially when input text is very long.
Summarization of very long documents presents a problem of delivering coherent
summaries (Gupta & Siddiqui, 2012). At times, models can be misled by metaphors or
idiomatic expressions and change the meaning of text during summarization process.
Another issue is that these models are pretrained on English text data and might not
perform well in text summarization in other than nonglobal languages (Aksenov et al.,
2020).
140 Chapter 7
Figure 7.4
Qualitative evaluation of short summaries with annotations (2). From Kocbek, P., Gosak, L.,
Musovic , K., & Stiglic, G. (2022). Generating Extremely Short Summaries from the Scientific Literature to
Support Decisions in Primary Healthcare: A Human Evaluation Study. In Lecture Notes in Computer Science
(including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 13263,
pp. 373e382). Springer Science and Business Media Deutschland GmbH. https://doi.org/10.1007/978-3-
031-09342-5_37.
Review of data-driven generative AI models 141
GPT-4
A study conducted on Latinx patients at Caridad Center in Florida found that multi-morbidity is prevalent among those suffering from
depression and/or anxiety. The majority of the patients were female, and many had multiple chronic conditions, such as hypertension,
hyperlipidemia, and type 2 diabetes. The study highlights the need for further investigation into how lifestyle and other risk factors affect
the severity of mental illness in this population.
*FRE: 30-50 difficult to read, best understood by college graduates
Figure 7.5
Examples of qualittaive improvements in short summarization using GPT-4 (March 2023 version)
with annotations.
142 Chapter 7
Content from images can be extracted using various methods, such as optical character
recognition (OCR), which extracts text from images (Patel et al., 2012), or image
captioning models, which can generate descriptions of images. Features such as
relationships or patterns are usually not interpretable and understandable by the model,
which can affect the quality of the generated summary (Fan et al., 2018).
One of the main limitations is the scalability of human evaluations, both in terms of time
and costs, but are still favored in current trends and best practice guidelines (van der Lee
et al., 2021).
4. Discussion
This review explores the findings and implications of the research conducted on the
application of SOTA models in text summarization within the healthcare domain. It delves
into the various approaches and techniques adopted by researchers.
This study observed that only a few studies utilized models, such as those that are based
on GPT-3 for the purpose of text summarization in a healthcare domain. Moreover, most
included studies dealt with an extractive summarization. Consequently, there exists a lack
of in-depth exploration of abstractive summarization, that provides paraphrased, concise,
and coherent summaries. Studies included in this review focus on generating summaries
from clinical, biomedical texts, speech-to-text generated text, and nursing entries in EHRs.
Based on that, there is a potential in research dealing with radiology reports, pathology
reports, and similar. Another area, that can be focused on, is in establishing standardized
evaluation metrics for assessing the effectiveness of methods for generating short
summaries. Since some studies mentioned metrics such as readability, completeness, other
naturalness, quality, and informativeness. Summary evaluation in healthcare lags behind
novel evaluation approaches that are proposed for LLM. In general, studies evaluate
summaries with limited, single aspects (fluency, readability .) or multiaspects but neglect
their definition or even more relationship among aspects. Even more, such evaluations
require time-consuming manual annotations of samples. These issues were divulged by Fu
et al., who proposed a novel network GPTScore that utilizes NLP instructions (zero-shot
instructions) and contextual learning to overcome multiple evaluation challenges
simultaneously (Fu et al., 2023). Another approach to summary evaluation was introduced
by Liu et al. They proposed G-Eval that uses LLMs with chain-of-thoughts (CoT) (Liu
et al., 2023; Wei et al., 2022). It requires a prompt with the definition of the evaluation
task and criteria, which is a collection of intermediary instructions (CoT) generated by
LLM that delineate the precise sequence of evaluation steps (Liu et al., 2023). The
difference in evaluation scoring between these two is that GPTScore uses probability in
text generation as an evaluation metric, whereas G-Eval performs the evaluation with a
form-filling paradigm.
Review of data-driven generative AI models 143
The use of NLP for short text generation has not reached its peak potential. Due to the
known issues, it is still unclear when the need for human evaluation could be replaced by
an LLM.
Acknowledgments
This project is supported by the European Union’s Horizon Europe research and innovation programme under
the Grant Agreement No 101159018. Views and opinions expressed are however those of the author(s) only and
do not necessarily reflect those of the European Union or the European Research Executive Agency (REA).
Neither the European Union nor the granting authority can be held responsible for them. The authors also
acknowledge partial support from the Slovenian Research Agency (grant numbers N3-0307 and P2-0057).
References
Agrawal, M., Hegselmann, S., & Lang, H. (2022). Large language models are few-shot clinical information
extractors.
Aksenov, D., Moreno-Schneider, J., Bourgonje, P., Schwarzenberg, R., Hennig, L., & Rehm, G. (2020).
Abstractive text summarization based on language model conditioning and locality modeling. In LREC
2020 - 12th international conference on language resources and evaluation, conference proceedings (pp.
6680e6689). Germany: European Language Resources Association (ELRA).
Arksey, H., & O’Malley, L. (2005). Scoping studies: Towards a methodological framework. International
Journal of Social Research Methodology: Theory and Practice, 8(1), 19e32. https://doi.org/10.1080/
1364557032000119616
Aydın, Ö., & Karaarslan, E. (2022). OpenAI ChatGPT generated literature review: Digital twin in healthcare.
SSRN Electronic Journal. https://doi.org/10.2139/ssrn.4308687
Bahdanau, D., Cho, K. H., & Bengio, Y. (2015). Neural machine translation by jointly learning to align and
translate. In 3rd international conference on learning representations, ICLR 2015 - conference track
proceedings. Germany: International Conference on Learning Representations, ICLR. https://dblp.org/db/
conf/iclr/iclr2015.html.
Bataa, E., & Wu, J. (2020). An investigation of transfer learning-based sentiment analysis in Japanese. In ACL
2019 - 57th annual meeting of the association for computational linguistics, proceedings of the conference
(pp. 4652e4657). Japan: Association for Computational Linguistics (ACL).
Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P.,
Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A.,
Ziegler, D. M., Wu, J., Winter, C., … Amodei, D. (2020). Language models are few-shot learners. arXiv.
https://arxiv.org.
Cachola, I., Lo, K., Cohan, A., & Weld, D. S. (2020). TLDR: Extreme summarization of scientific documents.
In Findings of the association for computational Linguistics findings of ACL: Emnlp 2020 (pp.
4766e4777). United States: Association for Computational Linguistics (ACL). https://aclanthology.org/
events/emnlp-2020/#2020-findings-emnlp.
Chintagunta, B., Katariya, N., Amatriain, X., & Kannan, A. (2021). Medically aware GPT-3 as a data generator
for medical dialogue summarization. arXiv. https://arxiv.org. undefined.
Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2019). Bert: Pre-training of deep bidirectional
transformers for language understanding. In NAACL HLT 2019 - 2019 conference of the North American
chapter of the association for computational Linguistics: Human language technologies - proceedings of
the conference (Vol. 1, pp. 4171e4186). Association for Computational Linguistics (ACL).
Edmundson, H. P. (1969). New methods in automatic extracting. Journal of the ACM, 16(2), 264e285. https://
doi.org/10.1145/321510.321519
144 Chapter 7
Erkan, G., & Radev, D. R. (2011). LexRank: Graph-based lexical centrality as salience in text summarization.
Journal of Artificial Intelligence Research, 22, 457e479. https://doi.org/10.1613/jair.1523
Fan, C., Zhang, Z., & Crandall, D. J. (2018). Deepdiary: Lifelogging image captioning and summarization. Journal
of Visual Communication and Image Representation, 55, 40e55. https://doi.org/10.1016/j.jvcir.2018.05.008
Fu, J., Ng, S.-K., Jiang, Z., & Liu, P. (2023). GPTScore: Evaluate as you desire. arXiv prepr arXiv230204166.
Fum, D., Guida, G., & Lasso, C. (1986). Tailoring importance evaluation to reader’s goals: A contribution to
descriptive text summarization.
Goldstein, A., & Shahar, Y. (2013). Implementation of a system for intelligent summarization of longitudinal
clinical records, 16113349. In Lecture notes in computer science (including subseries lecture notes in
artificial intelligence and lecture notes in Bioinformatics) (Vol. 8268, pp. 68e82). https://doi.org/10.1007/
978-3-319-03916-9_6. Israel.
Goldstein, A., Shahar, Y., Orenbuch, E., & Cohen, M. J. (2017). Evaluation of an automated knowledge-based
textual summarization system for longitudinal clinical data, in the intensive care domain. Artificial
Intelligence in Medicine, 82, 20e33. https://doi.org/10.1016/j.artmed.2017.09.001
Gupta, S., & Gupta, S. K. (2019). Abstractive summarization: An overview of the state of the art. Expert
Systems with Applications, 121, 49e65. https://doi.org/10.1016/j.eswa.2018.12.011
Gupta, V. K., & Siddiqui, T. J. (2012). Multi-document summarization using sentence clustering. In 4th
international conference on intelligent human computer interaction: Advancing technology for humanity,
IHCI 2012. https://doi.org/10.1109/IHCI.2012.6481826. India.
Gutierrez, M., Diaz-Martinez, J., Maisonet, J., Ayala, J., Pierre, A. J., Kallus, L., & Martinez, S. (2021). Co-
Morbid health conditions in latinx adults receiving care for depression and anxiety. Current Developments
in Nutrition, 5, 128. https://doi.org/10.1093/cdn/nzab035_036
Hannousse, A. (2021). Searching relevant papers for software engineering secondary studies: Semantic Scholar
coverage and identification role. IET Software, 15(1), 126e146. https://doi.org/10.1049/sfw2.12011
Jeffrey, A. R., & Rewon, W. (2019). Language models are unsupervised multitask learners | enhanced reader.
OpenAI Blog, 1.
Jiao, W., Wang, W., Huang, J. T., Wang, X., & Tu, Z. (2023). Is ChatGPT A good translator? Yes with GPT-4
as the engine. arXiv. https://doi.org/10.48550/arXiv.2301.08745
Kanellopoulou, A., Katelari, A., Notara, V., Antonogeorgos, G., Rojas-Gil, A. P., Kornilaki, E. N., Kosti, R. I.,
Lagiou, A., & Panagiotakos, D. B. (2021). Parental health status in relation to the nutrition literacy level
of their children: Results from an epidemiological study in 1728 Greek students. Mediterranean Journal
of Nutrition and Metabolism, 14(1), 57e67. https://doi.org/10.3233/MNM-200470
Kasai, J., Sakaguchi, K., Le Bras, R., Dunagan, L., Morrison, J., Fabbri, A. R., Choi, Y., & Smith, N. A.
(2022). Bidimensional leaderboards: Generate and evaluate language hand in hand. In NAACL 2022 -
2022 conference of the North American chapter of the association for computational linguistics: Human
language technologies, proceedings of the conference. United States: Association for Computational
Linguistics (ACL). https://aclanthology.org/events/naacl-2022/#2022-naacl-main.
Kocbek, P., Gosak, L., Musovic, K., & Stiglic, G. (2022). Generating extremely short summaries from the
scientific literature to support Decisions in primary healthcare: A human evaluation study. In Lecture notes
in computer science (including subseries lecture notes in artificial intelligence and lecture notes in
Bioinformatics) (Vol. 13263, pp. 373e382). Slovenia: Springer Science and Business Media Deutschland
GmbH. https://doi.org/10.1007/978-3-031-09342-5_37
Kojima, T., Reid, M., Matsuo, Y., Gu, S. S., & Iwasawa, Y. (2022). Large Language models are zero-shot
reasoners. arXiv. https://doi.org/10.48550/arXiv.2205.11916
Korngiebel, D. M., & Mooney, S. D. (2021). Considering the possibilities and pitfalls of Generative Pre-trained
Transformer 3 (GPT-3) in healthcare delivery. Npj Digital Medicine, 4(1). https://doi.org/10.1038/s41746-
021-00464-x
Lee, E. K., & Uppal, K. (2020). Cerc: An interactive content extraction, recognition, and construction tool for
clinical and biomedical text. BMC Medical Informatics and Decision Making, 20. https://doi.org/10.1186/
s12911-020-01330-8
Review of data-driven generative AI models 145
Lin, C.-Y., & Och, F. J. (2004). Looking for a few good metrics: ROUGE and its evaluation. In Kando
N. & Ishikawa H. (Eds.), Proceedings of the Fourth NTCIR Workshop on Research in Information Access
Technologies Information Retrieval, Question Answering and Summarization, 4, Tokyo, Japan: National
Institute of Informatics. https://research.nii.ac.jp/ntcir/workshop/OnlineProceedings4/OPEN/NTCIR4-
OPEN-LinCY.pdf.
Liu, Y., Iter, D., & Xu, Y. (2023). GPTEval: NLG evaluation using GPT-4 with better human alignment. arXiv
prepr arXiv230316634.
Luhn, H. P. (1958). The automatic creation of literature abstracts. IBM Journal of Research and Development,
2(2), 159e165. https://doi.org/10.1147/rd.22.0159
Mihalcea, R. (2004). Graph-based ranking algorithms for sentence extraction, applied to text summarization. In
Proceedings of the annual meeting of the association for computational linguistics (Vol. 2004)United
States: Association for Computational Linguistics (ACL). https://aclweb.org/.
Models. (2023). OpenAI API. https://platform.openai.com/docs/models/overview. (Accessed 21 June 2023).
Moradi, M., Blagec, K., Haberl, F., & Samwald, M. (2021). GPT-3 models are poor few-shot learners in the
biomedical domain. arXiv. https://doi.org/10.48550/arXiv.2109.02555
New models and developer products announced at DevDay. (2023). https://openai.com/blog/new-models-and-
developer-products-announced-at-devday.
Nomoto, T., & Matsumoto, Y. (2001). A new approach to unsupervised text summarization. In SIGIR forum
(ACM special interest group on information retrieval) (pp. 26e34). Japan: Association for Computing
Machinery (ACM). https://doi.org/10.1145/383952.383956
OpenAI. (2022). https://openai.com/. (Accessed 1 December 2022).
OpenAI. (2023). https://openai.com/.
Page, M. J., McKenzie, J. E., Bossuyt, P. M., Boutron, I., Hoffmann, T. C., Mulrow, C. D., Shamseer, L.,
Tetzlaff, J. M., Akl, E. A., Brennan, S. E., Chou, R., Glanville, J., Grimshaw, J. M., Hróbjartsson, A.,
Lalu, M. M., Li, T., Loder, E. W., Mayo-Wilson, E., McDonald, S., … Moher, D. (2022). The PRISMA
2020 statement: An updated guideline for reporting systematic reviews. Revista Panamericana de Salud
Publica/Pan American Journal of Public Health, 46. https://doi.org/10.26633/RPSP.2022.112
Patel, C., Patel, A., & Patel, D. (2012). Optical character recognition by open source OCR tool tesseract: A
case study. International Journal of Computer Applications, 55(10), 50e56. https://doi.org/10.5120/8794-
2784
Radford, A., Narasimhan, K., Salimans, T., & Sutskever, I. (2018). Improving language understanding by
generative pre-training. OpenAI Blog. https://cdn.openai.com/research-covers/language-unsupervised/
language_understanding_paper.pdf.
Reunamo, A., Peltonen, L. M., Mustonen, R., Saari, M., Salakoski, T., Salanterä, S., & Moen, H. (2022). Text
classification model explainability for keyword extraction-towards keyword-based summarization of
nursing care episodes. Studies in Health Technology and Informatics, 290, 632e636. https://doi.org/
10.3233/SHTI220154
Romera-Paredes, B., & Torr, P. H. S. (2015). An embarrassingly simple approach to zero-shot learning. In
32nd international conference on machine learning, ICML 2015 (Vol. 3, pp. 2142e2151). United
Kingdom: International Machine Learning Society (IMLS).
Sadeh-Sharvit, S., Rego, S. A., Jefroykin, S., Peretz, G., & Kupershmidt, T. (2022). A comparison between
clinical guidelines and real-world treatment data in examining the use of session summaries: Retrospective
study. JMIR Formative Research, 6(8), e39846. https://doi.org/10.2196/39846
Spaulding, E. M., Marvel, F. A., Piasecki, R. J., Martin, S. S., & Allen, J. K. (2021). User engagement with
smartphone apps and cardiovascular disease risk factor outcomes: Systematic review. JMIR Cardio, 5(1).
https://doi.org/10.2196/18834
Sreelekha, S., Bhattacharyya, P., Jha, S. K., & Malathi, D. (2016). A survey report on evolution of machine
translation. International Journal of Control Theory and Applications, 9(33), 233e240.
Stiglic, G., Musovic, K., Gosak, L., Fijacko, N., & Kocbek, P. (2022). Relevance of automated generated short
summaries of scientific abstract: Use case scenario in healthcare. In Proceedings - 2022 IEEE 10th
146 Chapter 7
international conference on healthcare informatics, ICHI 2022 (pp. 599e605). Slovenia: Institute of
Electrical and Electronics Engineers Inc. https://doi.org/10.1109/ICHI54592.2022.00118
Stine, J. G., Soriano, C., Schreibman, I., Rivas, G., Hummer, B., Yoo, E., Schmitz, K., & Sciamanna, C.
(2021). Breaking down barriers to physical activity in patients with nonalcoholic fatty liver disease.
Digestive Diseases and Sciences, 66(10), 3604e3611. https://doi.org/10.1007/s10620-020-06673-w
Suanmali, L., Salim, N., & Binwahlan, M. S. (2009). Fuzzy logic based method for improving text
summarization. IJCSIS). International Journal of Computer Science and Information Security, 2.
Sutskever, I., Vinyals, O., & Le, Q. V. (2014). Sequence to sequence learning with neural networks. Advances
in Neural Information Processing Systems, 4(January), 3104e3112.
Trivedi, G., Handzel, R., Visweswaran, S., Chapman, W., & Hochheiser, H. (2018). An interactive NLP tool
for signout note preparation. In Proceedings - 2018 IEEE international conference on healthcare
informatics, ICHI 2018 (pp. 426e428). United States: Institute of Electrical and Electronics Engineers
Inc. https://doi.org/10.1109/ICHI.2018.00084
van der Lee, C., Gatt, A., van Miltenburg, E., & Krahmer, E. (2021). Human evaluation of automatically
generated text: Current trends and best practice guidelines. Computer Speech & Language, 67, 101151.
https://doi.org/10.1016/j.csl.2020.101151
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., & Polosukhin, I.
(2017). Attention is all you need. In Advances in neural information processing systems (Vol. 2017, pp.
5999e6009). United States: Neural information processing systems foundation.
Wei, J., Wang, X., Schuurmans, D., Bosma, M., Ichter, B., Xia, F., Chi, E. H., Le, Q. V., & Zhou, D. (2022).
Chain-of-Thought prompting elicits reasoning in large Language Models. arXiv. https://doi.org/10.48550/
arXiv.2201.11903
Yan, R., Kong, L., Huang, C., Wan, X., Li, X., & Zhang, Y. (2011). Timeline generation through evolutionary
trans-temporal summarization. In EMNLP 2011 - conference on empirical methods in natural language
processing, proceedings of the conference (pp. 433e443) (China).
Zhang, H., Cai, J., Xu, J., & Wang, J. (2019). Pretraining-based natural language generation for text
summarization. In CoNLL 2019 - 23rd conference on computational natural language learning,
proceedings of the conference (pp. 789e797). China: Association for Computational Linguistics.
Zhang, J., Oh, Y. J., Lange, P., Yu, Z., & Fukuoka, Y. (2020). Artificial intelligence chatbot behavior change
model for designing artificial intelligence chatbots to promote physical activity and a healthy diet:
Viewpoint. Journal of Medical Internet Research, 22(9), e22845. https://doi.org/10.2196/22845
Zhang, J., Zhao, Y., Saleh, M., & Liu, P. J. (2020). Pegasus: Pre-Training with extracted gap-sentences for
abstractive summarization, 168147-15. In 37th international conference on machine learning, ICML 2020
(pp. 11265e11276). United Kingdom: International Machine Learning Society (IMLS).