0% found this document useful (0 votes)
60 views

Large AI Models in Health Informatics - Applications, Challenges, and The Future

Uploaded by

Megan Revita
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
60 views

Large AI Models in Health Informatics - Applications, Challenges, and The Future

Uploaded by

Megan Revita
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

6074 IEEE JOURNAL OF BIOMEDICAL AND HEALTH INFORMATICS, VOL. 27, NO.

12, DECEMBER 2023

Large AI Models in Health Informatics:


Applications, Challenges, and the Future
Jianing Qiu , Lin Li , Jiankai Sun , Graduate Student Member, IEEE, Jiachuan Peng , Peilun Shi,
Ruiyang Zhang, Yinzhao Dong, Kyle Lam , Frank P.-W. Lo , Bo Xiao ,
Wu Yuan , Senior Member, IEEE, Ningli Wang, Dong Xu , Member, IEEE,
and Benny Lo , Senior Member, IEEE

Abstract—Large AI models, or foundation models, are their potential to transform different domains of our lives.
models recently emerging with massive scales both In health informatics, the advent of large AI models has
parameter-wise and data-wise, the magnitudes of which brought new paradigms for the design of methodologies.
can reach beyond billions. Once pretrained, large AI mod- The scale of multi-modal data in the biomedical and health
els demonstrate impressive performance in various down- domain has been ever-expanding especially since the com-
stream tasks. A prime example is ChatGPT, whose capa- munity embraced the era of deep learning, which provides
bility has compelled people’s imagination about the far- the ground to develop, validate, and advance large AI mod-
reaching influence that large AI models can have and els for breakthroughs in health-related areas. This article
presents a comprehensive review of large AI models, from
Manuscript received 21 March 2023; revised 3 August 2023; ac- background to their applications. We identify seven key
cepted 8 September 2023. Date of publication 22 September 2023; sectors in which large AI models are applicable and might
date of current version 6 December 2023. This work was supported have substantial influence, including: 1) bioinformatics; 2)
in part by the Research Grants Council (RGC) of Hong Kong SAR medical diagnosis; 3) medical imaging; 4) medical informat-
under Grants ECS24211020, GRF14203821, and GRF14216222, in
ics; 5) medical education; 6) public health; and 7) medical
part by the Innovation and Technology Fund (ITF) of Hong Kong SAR
under Grant ITS/240/21, in part by the Science, Technology and In- robotics. We examine their challenges, followed by a crit-
novation Commission (STIC) of Shenzhen Municipality under Grant ical discussion about potential future directions and pit-
SGDX20220530111005039, and in part by the Bill & Melinda Gates falls of large AI models in transforming the field of health
Foundation under Grant OPP1171395. (Corresponding authors: Wu informatics.
Yuan; Benny Lo.)
Jianing Qiu was with the Precision Robotics (Hong Kong) Ltd., Hong Index Terms—Artificial intelligence, bioinformatics,
Kong. He is now with the Department of Computing, Imperial College biomedicine, deep learning, foundation model, health
London, SW7 2AZ London, U.K., and also with the Department of informatics, healthcare, medical imaging.
Biomedical Engineering, The Chinese University of Hong Kong, Hong
Kong (e-mail: [email protected]).
Lin Li is with the Department of Informatics, King’s College London, I. INTRODUCTION
WC2R 2LS London, U.K. (e-mail: [email protected]).
HE introduction of ChatGPT [1] has triggered a new
Jiankai Sun is with the School of Engineering, Stanford University,
Stanford, CA 94305 USA (e-mail: [email protected]).
Jiachuan Peng is with the Department of Engineering Sci-
ence, University of Oxford, OX1 2JD Oxford, U.K. (e-mail: ji-
T wave of development and deployment of Large AI Models
(LAMs) recently. As shown in Fig. 1, ChatGPT and the phe-
[email protected]).
nomenal Segment Anything Model (SAM) [2] have sparked
Peilun Shi and Wu Yuan are with the Department of Biomedical En- active research in medical and health sectors since their initial
gineering, The Chinese University of Hong Kong, Hong Kong (e-mail: launch. Although groundbreaking, the AI community has in
[email protected]; [email protected]). fact started creating LAMs much earlier, and it was the seminal
Ruiyang Zhang is with the Precision Robotics (Hong Kong) Ltd., Hong
Kong (e-mail: [email protected]).
work introducing the Transformer model [3] back in 2017 that
Yinzhao Dong is with the Faculty of Engineering, The University of accelerated the creation of LAMs.
Hong Kong, Hong Kong (e-mail: [email protected]). The recent advances in data science and AI algorithms have
Kyle Lam is with the Department of Surgery and Cancer, Imperial Col- endowed LAMs with strengthened generative and reasoning
lege London, SW7 2AZ London, U.K. (e-mail: [email protected]). capabilities, as well as generalist intelligence across multiple
Frank P.-W. Lo and Bo Xiao are with the Hamlyn Centre for Robotic
Surgery, Imperial College London, SW7 2AZ London, U.K. (e-mail:
tasks with impressive zero- and few-shot performance, signifi-
[email protected]; [email protected]). cantly distinguishing them from early deep models. For example,
Ningli Wang is with the Beijing Tongren Eye Center, Beijing Tongren when asked for medical advice, ChatGPT, based on GPT-4 [4],
Hospital, Capital Medical University, Beijing 100054, China, and also demonstrates the capability of recalling prior conversation and
with Beijing Ophthalmology & Visual Sciences Key Laboratory, Beijing being able to contextualize the user’s past medical history before
100005, China (e-mail: [email protected]).
Dong Xu is with the Department of Electrical Engineering, University answering, showing a new level of intelligence way beyond that
of Missouri, Columbia, MO 65211 USA, and also with the Christopher S. of a simple symptom checker [5].
Bond Life Sciences Center, University of Missouri, Columbia, MO 65211 One notable bottleneck of developing supervised medical and
USA (e-mail: [email protected]). clinical AI models is that they require annotated data at scale for
Benny Lo is with the Facualty of Medicine, Imperial College London,
SW7 2AZ London, U.K., and also with the Precision Robotics (Hong
training a well-functioning model. However, such annotations
Kong) Ltd., Hong Kong (e-mail: [email protected]). have to be conducted by domain experts, which is often expen-
Digital Object Identifier 10.1109/JBHI.2023.3316750 sive and time-consuming. This causes the curation of large-scale

© 2023 The Authors. This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see
https://creativecommons.org/licenses/by/4.0/
QIU et al.: LARGE AI MODELS IN HEALTH INFORMATICS 6075

is organized as follows: Section II describes the background of


LAMs in general domains, such as natural language processing
(NLP) and computer vision (CV); Section III discusses current
progress and possible applications of LAMs in key sectors of
health informatics; Section IV discusses challenges, limitations
and risks of LAMs; Section V points out some potential fu-
ture directions of advancing LAMs in health informatics, and
Section VI concludes.
As this field progresses very rapidly, and also due to the page
limit, there are a lot of works that this paper cannot cover. It is our
hope that the community can be updated with the latest advances,
so we refer readers to our website1 for the latest progress about
LAMs.

II. BACKGROUND OF LARGE AI MODELS


Fig. 1. Number of publications related to ChatGPT and SAM in med- The burgeoning AI community has devoted much effort to
ical and health areas. Statistics were queried from Google Scholar developing large AI models (LAMs) in recent years by lever-
with the keywords “Medical ChatGPT” or “Medical Segment Anything”, aging the massive influx of data and computational resources.
and the last entry was 31-th Aug. 2023. From April to August, each
month, there were over 200 publications about ChatGPT in medicine Based on the pre-training data modality, this article categorizes
and healthcare. the current LAMs into three types and defines them as follows:
1) Large Language Model (LLM): LLMs are pre-trained
on language data and applied to language downstream
medical and clinical data with high-quality annotations to be tasks. Language in different settings can have different
challenging. However, this may no longer be a bottleneck for interpretations, e.g., protein is the language of life, and
LAMs, as they can leverage self-supervision and reinforcement code is the language of computers.
learning in training, relieving the annotation burden and work- 2) Large Vision Model (LVM): LVMs are pre-trained on
load of curating large-scale annotated datasets [6]. With the vision data and applied to vision downstream tasks.
ever-increasing proliferation of medical Internet of things such 3) Large Multi-modal Model (LMM): LMMs are pre-trained
as pervasive wearable sensors, medical and clinical history such on multi-modal data, e.g., language and vision data, and
as electronic health records (EHRs), prevalent medical imaging
applied to various single- or multi-modal downstream
for diagnosis such as computed-tomography (CT) scans, the
growing genomic sequence discovery, and more, the abundance tasks.
of biomedical, clinical, and health data fosters the development This section provides an overview of the background of these
of the next generation of AI models in the field, which are three types of LAMs in general domains.
expected to have a large capacity for modeling the complexity
and magnitude of health-related data, and generalize to multiple A. Large Language Models
unseen scenarios to actively assist and engage in clinical and The proposal of the Transformer architecture [3] heralds the
medical decision-making. start of developing large language models (LLMs) in the field
Despite the homogeneity of the model architecture (current of NLP. Since 2018, following the birth of GPT (Generative
LAMs are primarily based on Transformer [3]), LAMs inher- Pre-trained Transformer) [8] and BERT (Bidirectional Encoder
ently are strong learners of heterogeneous data due to their large Representations from Transformers) [9], the development of
capacity, unified input modeling of different modalities, and LLMs has progressed rapidly.
improved multi-modal learning techniques. Multi-modality is Broadly speaking, the recent LLMs [10], [11], [12], [13],
common in biomedical and health settings, and the multi-modal [14], [15], [16], [17], [18], [19], [20], [21], have the following
nature of health data provides the natural and promising ground three distinct characteristics: 1) parameter-wise, the number of
for developing and evaluating LAMs. learnable parameters of an LLM can be easily scaled up to
The LAMs that this article discusses are mainly foundation billions; 2) data-wise, a large volume of unlabelled data are
models [7]. However, this article also provides a retrospective used to pre-train an LLM, and the amount can often reach
of the recent LAMs that are not necessarily considered foun- millions or billions if not more; 3) paradigm-wise, LLMs are first
dational at their current stage, but are seminal in advancing pre-trained often with weakly- or self-supervised learning (e.g.,
the future development of LAMs in the fields of biomedicine masked language modeling [9] and next token prediction [4]),
and health informatics. Fig. 2 summarizes the key features of and then fine-tuned or adapted to various downstream tasks such
LAMs, and highlights the paradigm shift it is introducing, i.e., as question answering and dialogue in which they are able to
1) large-scale model size; 2) large-scale training/pre-training; demonstrate impressive performance.
and 3) large generalization. Recent advances reveal that LLMs are impressive zero-shot,
Albeit inspirational, LAMs still face challenges and limita- one-shot, and few-shot learners. They are able to extract, summa-
tions, and the rapid rise of LAMs brings new opportunities rize, translate, and generate textual information with only a few
as well as potential pitfalls. This article aims to provide a
comprehensive review of the recent developments of LAMs,
with a particular focus on their impacts on the biomedical and 1 [Online]. Available: https://github.com/Jianing-Qiu/Awesome-Healthcare-
health informatics communities. The remainder of this article Foundation-Models
6076 IEEE JOURNAL OF BIOMEDICAL AND HEALTH INFORMATICS, VOL. 27, NO. 12, DECEMBER 2023

Fig. 2. Key features of large AI models lie in the following four aspects: 1) increased size (e.g., for large language models (LLMs), the number
of parameters is often billions); 2) trained with large-scale data (e.g., for LLMs, the data can contain trillions of tokens; and for large vision models
(LVMs), the data can contain billions of images); 3) able to process data of multiple modalities; and 4) can perform well across multiple downstream
tasks, especially on zero-, one-, and few-shot tasks.

or even no prompt/fine-tuning samples [4]. Furthermore, LLMs can possibly advance the development of future LVMs using
manifest impressive reasoning capability, and this capability RLHF.
can be further strengthened with prompt engineering techniques
such as Chain-of-Thought prompting [22].
B. Large Vision Models
There was an upsurge in the number of new LLMs from
2022 onwards. Despite the general consensus that scaling up the In computer vision, it has been a common practice for years to
number of parameters and the amount of data will lead to im- first pre-train a model on a large-scale dataset and then fine-tune
proved performance, which leads to a dominant trend of develop- it on the dataset of interest (usually smaller than the one for
ing LLMs often with billions of parameters (e.g., LLMs such as pre-training) for improved generalization [28]. The fundamental
PaLM [13] have already contained 540 billion parameters) and changes driving this evolution of large models lie in the scale
even trillions of data tokens (e.g., LLaMa 2 was pre-trained with of pre-training datasets and models, and the pre-training
2 trillion tokens [11], and the training data of RETRO [23] had methods. ImageNet-1 K (1.28 M images) [29] and -21 K (14 M
over 5 trillion tokens), there is currently no concerted agreement images) [30] used to be canonical datasets for visual pre-training.
within the community that if this continuous growth of model ImageNet is manually curated for high-quality labeling, but the
and data size is optimal [10], [14], and there is also lacking a prohibitive cost of curation severely hindered further scaling. To
verified universal scaling law. push the scale beyond ImageNet’s, datasets like JFT (300M [31]
To balance the data annotation cost and efficacy, as well and 3B [32] images) and IG (3.5B [33] and 3.6B [34] images)
as to train an LLM that can better align with human intent, were collected from the web with less or no curation. The quality
researchers have commonly used reinforcement learning from of annotation is therefore compromised, and the accessibility of
human feedback (RLHF) [24] to develop LLMs that can exhibit datasets become limited because of copyright issues.
desired behaviors. The core idea of RLHF is to use human The compromised annotation requires the pre-training
preference datasets to train a Reward Model (RM), which can paradigm to shift from supervised learning to weakly-/self-
predict the reward function and be optimized by RL algorithms supervised learning or unsupervised learning. The latter meth-
(e.g., Proximal Policy Optimization (PPO) [25]). The framework ods include autoregressive modeling, generative modeling and
of RLHF has attracted much attention and become a key com- contrastive learning. Autoregressive modeling trains the model
ponent of many LLMs, such as InstructGPT [19], Sparrow [26], to autoregressively predict the next pixel conditioned on the
and ChatGPT [1]. Recently, Susano Pinto et al. [27] have also preceding pixels [35]. Generative modeling trains the model
investigated this reward optimization in vision tasks, which to reconstruct the entire original image, or some target regions
QIU et al.: LARGE AI MODELS IN HEALTH INFORMATICS 6077

within it [36], from its corrupted [37] or masked [38] variants. encoder and multi-modal encoder if have, is pre-trained on the
Contrastive learning trains the model to discriminate similar aforementioned large-scale image-text datasets, and fine-tuned
and/or dissimilar data instances [39]. on the downstream tasks or to carry out zero-shot tasks with-
Vision Transformers (ViTs) and Convolutional Neural Net- out fine-tuning. The pre-training objectives can be multi-modal
works (CNNs) are two major architectural families of LVMs. For tasks only or with unimodal tasks (see Section II-B). Common
vision transformers, pioneering works ViT [40] and iGPT [35] multi-modal pre-training tasks contain image-text contrastive
transferred the transformer architectures from NLP to CV with learning [54], [55], [56], image-text matching [58], [61], [63],
minimal modification, but the resulting architectures incur high autoregressive modeling [59], [62], masked modeling [58],
computational complexity, which is quadratic to the image size. image-grounded text generation [61], [63], etc. Recent studies
Later, works like TNT [41] and Swin Transformer [42] were suggest that scaling the unimodal encoders up [60], [64] and
proposed to better adapt transformers to visual data. Recently, pre-training with multiple objectives across uni- and multi-
ViT-G/14 [32], SwinV2-G [43] and ViT-22B [44] substantially modalities [61], [63] can substantially benefit multi-modal rep-
scaled the vision transformers up using a bag of training tricks resentation learning.
to achieve state-of-the-art (SOTA) accuracy on various bench- Recently, LVLMs made a major breakthrough in text-to-
marks. While ViTs may seem to gain more momentum than image generation. There are generally two classes of methods for
CNNs in developing LVMs, to improve CNNs, the latest works such a task: autoregressive model [65], [66], [67] and diffusion
such as ConvNeXt [45] and InternImage [46] redesigned CNN model [67], [68], [69], [70]. Autoregressive model, like intro-
architecture with inspirations from ViTs and achieved SOTA duced in Section II-B, first concatenates the tokens (returned
accuracy on ImageNet. This refutes the previous statement that by some encoders) of text and images together and then learns
CNNs are inferior to ViTs. Apart from the above, recent works a model to predict the next item in the sequence. In contrast,
like CoAtNet [47] and ConViT [48] merge CNNs and ViTs to diffusion model first perturbs an image with random noise
form new hybrid architectures. Note that ViT-22B is the largest progressively until the image becomes complete noise (forward
vision model to date, whose scale is significantly larger than diffusion process) and then learns a model to gradually denoise
that (1.08B) of the current art of CNNs (InternImage) but is still the completely noisy image to restore the original image (reverse
much behind that of the contemporary LLMs. diffusion process) [71]. Text description is first encoded by a
Architecturally speaking, LVMs are largely-scaled-up vari- separate encoder and then integrated into the reverse diffusion
ants of their base architectures. How they are scaled up can process as the input to the model so that the image generation can
significantly impact the final performance. Simply increasing be conditioned on the text prompt. It is common to reuse those
the depth by repeating layers vertically may be suboptimal [49], pre-defined LLM and LVM architectures and/or their pre-trained
so a line of studies [46], [50], [51] investigate the rules for parameters as the aforementioned encoders. The scale of these
effective scaling. Furthermore, scaling the model size up is encoders and the generator can significantly impact the quality of
usually combined with larger-scale pre-training [49], [52] and generation and the ability of language understanding [66], [70].
efficient parallelism [53] for improved performance. The paradigm of bridging language and vision modalities
LVMs also transform other fundamental computer vision can be beyond learning, e.g., using LLM to instruct other
tasks beyond classification. The latest breakthrough in segmen- LVMs to perform vision-language tasks [72]. Beyond vision and
tation task is SAM [2]. SAM is built with a ViT-H image language, recent development in LMMs seeks to unify more
encoder (632 M), a prompt encoder and a transformer-based modalities under one single framework, e.g., ImageBind [73]
mask decoder that predicts object masks from the output of the combines six whereas Meta-Transformer [74] unifies twelve
above two encoders. Prompts can be points or bounding boxes modalities.
in images or text. SAM demonstrates a remarkable zero-shot
generalization ability to segment unseen objects and images.
Furthermore, to train SAM, a largest segmentation dataset to III. APPLICATIONS OF LARGE AI MODELS IN HEALTH
date, SA-1B, with over 1B masks is constructed. INFORMATICS
In this section, we identify seven key sectors in which LAMs
will have substantial influence and bring a new paradigm for
C. Large Multi-Modal Models tackling the problems and challenges in health informatics. The
This section describes large multi-modal models (LMMs). seven key sectors include 1) bioinformatics; 2) medical diag-
While the primary focus is on one type of LMMs: large vision- nosis; 3) medical imaging; 4) medical informatics; 5) medical
language models (LVLMs), multi-modality beyond vision and education; 6) public health; and 7) medical robotics. Table I
language is also summarized in the end. compares current LAMs with previous SOTA methods in these
Training LVLMs like CLIP [54] requires more than hundreds seven sectors.
of millions of image-text pairs. Such large amount of data was of-
ten closed source [54], [55], [56]. Until recently, LAION-5B [57]
was created with 5.85B data samples, matching the size of the A. Bioinformatics
largest private dataset while being available to the public. Molecular biology studies the roles of biological macro-
LVLMs usually adopt a dual-stream architecture: input text molecules (e.g., DNA, RNA, protein) in life processes and
and image are processed separately by their respective encoders describes various life activities and phenomena including the
to extract features. For representation learning, the features from structure, function, and synthesis of molecules. Although many
different modalities are then aligned through contrastive learn- experimental attempts have been made on this topic over
ing [54], [55], [56] or fused into a unified representation through decades [75], [76], [77], they are still of high cost, long ex-
another encoder on the top of all extracted features [58], [59], periment cycle, and high production difficulty. For example, the
[60], [61], [62]. Typically, the entire model, including unimodal number of experimentally determined protein structures stored
6078 IEEE JOURNAL OF BIOMEDICAL AND HEALTH INFORMATICS, VOL. 27, NO. 12, DECEMBER 2023

TABLE I
COMPARISON BETWEEN STATE-OF-THE-ART LAMS (SECOND ROW) AND PRIOR ARTS (FIRST ROW) IN TYPICAL TASKS OF SEVEN BIOMEDICAL AND
HEALTH SECTORS

in the protein data bank (PDB) hardly rivals the number of pro- residues, available templates, and multi-sequence alignment
tein sequences that have been generated. Efficient and accurate (MSA) embeddings. Especially, embeddings extracted from
computational methods are therefore needed and can be used MSA can infer the evolutionary information between aligned
to accelerate the protein structure determination process. Due sequences. Evoformer and structure modules were proposed to
to the huge number of parameters and learning capacity, LAMs update the input representation and predict the final 3D struc-
endow us with prospects to approach such a Herculean task. ture, the whole process of which was recycled several times.
Especially, LLMs’ outstanding representation learning ability Meanwhile, despite being trained on single-protein chains,
has been employed to implicitly model the biological properties AlphaFold2 exhibits the ability to predict multimers. To fur-
hidden in large-scale unlabeled data including RNA and protein ther enable multimeric inputs for training, DeepMind proposed
sequences. AlphaFold-Multimer [79], achieving impressive performance,
When it comes to the field of protein, starting from amino especially in heteromeric protein complexes structure predic-
acid sequences, we can analyze the spatial structure of proteins tion. Specifically, positional encoding was improved to encode
and furthermore understand their functions, and mutual interac- chains, and multi-chain MSAs were paired based on species
tions. AlphaFold2 [78] pioneered leveraging the attention-based annotations and target sequence similarity.
Transformer model [3] to predict protein structures. Specifically, In spite of the groundbreaking endeavors aforementioned
they treated structure prediction as a 3D graph inference prob- works have contributed, to achieving optimal prediction, they
lem, where the network’s inputs are pairwise features between still heavily rely on MSAs and templates searched from
QIU et al.: LARGE AI MODELS IN HEALTH INFORMATICS 6079

genetic and structure databases, which is time-consuming. Anal- capacity, which means they are still data-driven, and hence
ogous to mining semantic information in natural language, their ability to predict unseen types of data could sometimes be
researchers managed to explore co-evolution information in problematic. For instance, [93] stated that AlphaFold can barely
protein sequences in a self-supervised manner by employing handle missense mutation on protein structure due to the lack of a
large-scale protein language models (PLMs), which learn the corresponding dataset. Furthermore, how we can assess the qual-
global relation and long-range dependencies of unaligned and ity of model prediction for unknown protein structures remains
unlabelled protein sequences. ProGen (1.2B) [80] utilized a unclear. In turn, these unverified protein structures cannot be
conditional language model to provide controllable generation applied to, for example, drug discovery. Therefore, protocols and
of protein sequences. By inputting desired tags (e.g., function, metrics need to be established to assess their quality and potential
organism), ProGen can generate corresponding proteins such impacts. There are mutual and complementary benefits between
as enzymes with good functional activity. Elnaggar et al. [81] LAMs and conventional experimental techniques. LAMs can
devised ProtT5-XXL (11B) which was first trained on BFD [82] be re-designed to predict the process of protein folding and
and then fine-tuned on UniRef50 [83] to predict the secondary reveal their mutual interactions so as to facilitate experimental
structure. ESMfold [84] scaled the number of model param- methods. On the other hand, experimental information, such
eters up to 15B and observed a significant prediction im- as some physical properties of molecules, can be leveraged by
provement over AlphaFold2 (0.68 vs 0.38 for TM-score on LAMs to further improve prediction performance, especially
CASP14) with considerably faster inference speed when MSAs when dealing with rare data (e.g., orphan protein).
and templates are unavailable. Similarly, from only the primary
sequence input, OmegaFold [85] can outperform MSA-based
methods [78], [86], especially when predicting orphan proteins B. Medical Diagnosis
that are characterized by the paucity of homologous structure. As research has been carried out to improve the safety and
xTrimoPGLM [87] proposed a unified pre-training strategy strengthen the factual grounding of LAMs, it is foreseeable
that integrates the protein understanding and generation by that LAMs will play a significant role in medical diagnosis and
optimizing masked language modelling and general language decision-making.
modelling concurrently and achieved remarkable performance CheXzero [94], a zero-shot chest X-ray classifier, has demon-
over 13 diverse protein tasks with its 100B parameters. For strated radiologist-level performance in classifying multiple
instance, for GB1 fitness prediction in protein function task, xT- pathologies which it never saw in its self-supervised learning
rimoPGLM outperforms the previous SOTA method: Ankh [88], process. Recently, ChatCAD [95], a framework that integrates
with an 11% performance increase. Moreover, for antibody multiple diagnostic networks with ChatGPT, demonstrated a po-
structure prediction, xTrimoPGLM outperformed AlphaFold2 tential use case for applying LLMs in computer-aided diagnosis
(TM-score: 0.951) and achieved SOTA performance (TM-score: (CAD) for medical images. By stratifying the decision-making
0.961) with significantly faster inference speed. We underscore process with specialized medical networks, and followed by an
that in the presence of MSA, although the performance of PLMs iteration of prompts based on the outcomes of those networks
is hardly on par with Alphfold2, PLMs can make predictions as the queries to an LLM for medical recommendations, the
several orders of magnitude faster, which speeds up the process workflow of ChatCAD offers an insight into the integration of
of related applications such as drug discovery. In addition, be- the LLMs that were pre-trained using a massive corpus, with
cause PLMs implicitly understand the deep information implied the upstream specialized diagnostic networks for supporting
in protein sequences, they are promising to predict mutations in medical diagnosis and decision-making. Its follow-up work
protein structures and their potential impact to help guide the ChatCAD+ [96], shows improved quality of generating diag-
design of next-generation vaccines. nostic reports with the incorporation of a retrieval system. Using
In the context of RNA structure prediction, the number of external knowledge and information retrieval can potentially
nonredundant 3D RNA structures stored in PDB is significantly enable the resulting diagnostics more factually-grounded, and
less than that of protein structures, which hinders the accurate such a design has also been favoured and implemented in
and generalizable prediction of RNA structure from sequence the ChatDoctor model [97]. By leveraging a linear transfor-
information using deep learning. To mitigate the severe un- mation layer to align two medical LAMs, XrayGPT [98], a
availability of labeled RNA data, Chen et al. [89] proposed conversational chest X-ray diagnostic tool, shows decent ac-
the RNA foundation model (RNA-FM), which learns evolution- curacy in responding to diagnostic summary. While most LLMs
ary information implicitly from 23 million unlabeled ncRNA are based on English, researchers have also managed to fine-
sequences [90] by recovering masked nucleotide tokens, to tune LLaMa [10], an LLM, with Chinese medical knowledge,
facilitate multiple downstream tasks including RNA secondary and the resulting model shows improved medical expertise in
structure prediction and 3D closeness prediction. Especially, for Chinese [99].
secondary structure prediction, RNA-FM achieves 3-5% per- Apart from chest X-ray diagnostics and medical question
formance increase among three metrics (i.e., Precision, Recall, answering, LAMs have also been applied to other diagnostic
and F1-score), compared to UFold [91] which utilizes U-Net as scenarios. HeartBEiT [100], a foundation model pre-trained
the backbone. Furthermore, based on RNA-FM, Shen et al. [92] using 8.5 million electrocardiograms (ECGs), shows that large-
pioneered predicting 3D RNA structure directly. scale ECG pre-training could produce accurate cardiac diagnosis
Undoubtedly, these models are seminal and have reduced the and improved explainability of the diagnostic outcome, and the
time and cost of molecule structure prediction by a large margin. amount of annotated data for downstream fine-tuning could
Thereby, this raises the question of whether LAMs can com- be reduced. Medical LAMs may also potentially produce a
pletely replace experimental methods such as Cryo-EM [75]. more reliable forecast of treatment outcomes and the future
We deem that it still falls short from that point. Specifically, development of diseases using their strong reasoning capability.
the advance of LAMs builds upon Big Data and large model For example, Li et al. [101] proposed BEHRT, which is able to
6080 IEEE JOURNAL OF BIOMEDICAL AND HEALTH INFORMATICS, VOL. 27, NO. 12, DECEMBER 2023

predict the most likely disease of a patient in his/her next visit images to shorten the training time and reduce the computational
by learning from a large archive of EHRs. Rasmy et al. [102] costs. The reduced size inevitably causes information loss, e.g.,
proposed Med-BERT, which is able to predict the heart failure some small lesions that are critical for accurate recognition
of diabetic patients. might be removed in a compressed downsampled medical im-
With the ubiquity of internet, medical LAMs can also offer age, whereas doctors could examine the original high-resolution
remote diagnosis and medical consultation for people at home, image and spot these early-stage tumors. This may cause perfor-
providing people in need with more flexibility. We also envision mance discrepancies between current medical vision LAMs and
that future diagnosis of complex diseases may also be conducted well-trained doctors. In addition, although research has shown
or assisted by a panel of clinical LAMs. that increasing medical LAM size and data size could improve
medical domain performance of the model, e.g., STU-Net [109],
a medical segmentation model with 1.4 billion parameters, the
C. Medical Imaging best practice of model-data scaling is yet to be conclusive in
The adoption of medical imaging and vision techniques has medical imaging and vision.
vastly influenced the process of diagnosis and treatment of a
patient. The wide use of medical imaging, such as CT and MRI,
D. Medical Informatics
has produced a vast amount of multi-modal, multi-source, and
multi-organ medical vision data to accelerate the development In medical informatics, it has been a topic of long-standing
of medical vision LAMs. interest to leverage large-scale medical information and signals
The recent success of SAM [2] has drawn much attention to create AI models that can recognize, summarize, and generate
within the medical imaging community. SAM has been exten- medical and clinical content.
sively examined in medical imaging, especially on its zero-shot Over the past few years, with advances in the development
segmentation ability. While research revealed that for certain of LLMs [9], [13], [110], and the abundance of EHRs as well
medical imaging modalities and targets, the zero-shot perfor- as public medical text outlets such as PubMed [111], [112],
mance of SAM is impressive (e.g., on endoscopic and der- research has been carried out to design and propose Biomedical
moscopic images, as these are essentially RGB images, which LLMs. Since the introduction of BioBERT [113], a seminal
are the same type as that of SAM’s pre-training images), for Biomedical LLM which outperformed previous SOTA methods
imaging modalities that are medicine-specific such as MRI on various biomedical text mining tasks such as biomedical
and OCT, SAM often fails to segment targets in a zero-shot named entity recognition, many different Biomedical LLMs that
way [103], mainly because the topology and presentation of stem from their general LLM counterparts have been proposed,
a target in those imaging modalities are much different from including ClinicalBERT [114], BioMegatron [115], BioMe-
what SAM has seen during pre-training. Nevertheless, after dRoBERTa [116], Med-BERT [102], BioELECTRA [117], Pub-
adaptation and fine-tuning, the medical segmentation accuracy MedBERT [118], BioLinkBERT [119], BioGPT [120], and
of SAM can surpass current SOTA with a clear margin [104], Med-PaLM [121].
showing the potential of extending versatility of general LVMs The recent GatorTron [122] model (8.9 billion parameters)
to medical imaging with parameter-efficient adaptation. Apart pre-trained with de-identified clinical text (82 billion words)
from zero-shot segmentation, MedCLIP [105] was proposed, revealed that scaling up the size of clinical LLMs leads to
a contrastive learning framework for decoupled medical im- improvements on different medical language tasks, and the
ages and text, which demonstrated impressive zero-shot med- improvements are more substantial for complex ones, such as
ical image classification accuracy. In particular, it yielded over medical question answering and inference. Previously, the Pub-
80% accuracy in detecting Covid-19 infection in a zero-shot MedBERT work [118] also suggested that pre-training an LLM
setting. The recent PLIP model [106], built using image-text with biomedical corpora from scratch can lead to better results
pairs curated from medical Twitter, enables both image-based than continually training an LLM that has been pre-trained on the
and text-based pathology image retrieval, as well as enhanced general-domain corpora. While training large number of param-
zero-shot pathology image classification compared to CLIP [54]. eters may seem daunting, parameter-efficient adaptation tech-
Many medical imaging modalities are 3-dimensional niques such as low-rank adaptation (LoRA) [123] have enabled
(3D), and thus developing 3D medical LVMs are crucial. researchers to efficiently adapt a 13 billion LLaMa model to
Med3D [107], a heterogeneous 3D framework that enables produce decent US Medical Licensing Exam (USMLE) answers,
pre-training on multi-domain medical vision datasets, shows and the performance of a collection of such fine-tuned models,
strong generalization capabilities in downstream tasks, such as called MedAlpaca [124] also reveals that increasing model size
lung segmentation and pulmonary nodule classification. and quality of data can improve model’s medical domain exper-
With the success of generative LAMs such as Stable Dif- tise. As LLMs start to show emergent abilities [125] with their
fusion [68] in the general domain, which can generate realistic size scaled up increasingly, Agrawal et al. [126] revealed that
high-fidelity images with text descriptions, Chambon et al. [108] recent LLMs such as InstructGPT [19] and GPT-3 [110] can well
recently fine-tuned Stable Diffusion on medical data to generate extract clinical information in a few-shot setting despite being
synthetic chest X-ray images based on clinical descriptions. The not explicitly trained for the clinical domain. Med-PaLM [121],
encouraging generative capability of Stable Diffusion in the a Biomedical LLM with 540 billion parameters generated by
medical domain may inspire more future research on using gen- applying instruction prompt tuning on Flan-PaLM [12] (which
erative LAMs to augment medical data that are conventionally exhibited SOTA accuracy on MultiMedQA [121]), demon-
hard to obtain, and expensive to annotate. strated the ability to answer consumer medical questions that are
Nevertheless, some compromises are also evident in medical comparable to the performance of clinicians. Its follow-up work,
vision LAMs. For example, the currently common practice of Med-PaLM 2 [127], further strengthens medical reasoning, and
training LVMs and LMMs often limits the size of the medical as shown in Table I, it has reached an accuracy of 86.5% on
QIU et al.: LARGE AI MODELS IN HEALTH INFORMATICS 6081

the MedQA benchmark. As prompt engineering has become a study and outcomes, LAMs may also help design personalized
key technique for investigating and improving LLMs, Liévin and precise course materials for students in need. In addition,
et al. [128] have also applied various prompt engineering on LAMs may also help deliver remote medical education, provid-
the GPT-3.5 series such as InstructGPT [19] to understand ing engaging learning experiences and opportunities for students
their abilities on medical question answering, and their results living in resource-poor areas or from underprivileged families.
suggested that increasing Chain-of-Thoughts (CoTs) [22] per LAMs can also serve as a grading and scoring system in medical
question can deliver better, more interpretable medical question education, e.g., grading the surgical skill of a surgeon operating
responses. a surgical robot.
The impressive performance of Biomedical LLMs on medical In medical and clinical training such as nurse training, one can
language tasks shows their potential to be used to assist clinicians imagine a domain-knowledgeable LAM can act as an assistant
in processing, interpreting, and analyzing clinical and medical or a trainer to supervise the training. For certain frequent and
data more efficiently, and also to vastly reduce the time that clini- tedious routine medical training courses, human trainers tend
cians have to spend on documenting EHRs. Patel and Lam [129] to become less productive as training keeps repeating, and the
recently shed insight on using ChatGPT [1] to generate discharge quality of training delivery also varies among different human
summaries, which could potentially relieve doctors from labori- trainers. With a wide knowledge spectrum and responsive inter-
ous writing and improve their clinical productivity. Biomedical actions, training delivered by a LAM can potentially be more
LLMs can also assist in the writing of prior authorizations for engaging and productive, and the standard of training can be
insurance purposes, accelerating treatment authorizations [130]. maintained as equal and of high quality.
On the patient side, the zero-, one-, and few-shot learning
capability of LLMs may enable them to provide personalized
medical assistance based on the medical history of each individ- F. Public Health
ual patient. In addition, LLMs may also find them applicable in As the American epidemiologist Larry Brilliant said “out-
clinical trial matching. Based on candidates’ demographics and breaks are inevitable, but pandemics are optional”, with the
medical history, a Biomedical LLM may effectively generate world gradually returning to normal after the Covid-19 pan-
eligible matching, which accelerates clinical trial recruitment demic, if there is one thing that the world has to reflect on, it is
and initiation. how we become prepared to prevent the next pandemic.
Based on past public policy and interventions to contain the
spread of infectious diseases and the specific current situation,
E. Medical Education LLMs may help epidemiologists and policymakers to draft
It is likely that future medical education will also be influenced targeted public policies and recommend effective interventions.
by LAMs, as research continues to strengthen their scientific LLMs and other LAMs are also likely to be used to monitor,
grounding and creative generation. Many LAMs, such as GPT- track, forecast, and analyze the progress of new outbreaks.
4 [4] and Med PaLM 2 [127], have already passed USMLE with LAMs have been actively researched for drug discovery, e.g.,
a score of over 86%, demonstrating sound knowledge spectrum the Pangu Drug model [135], and they can potentially be used
and reasonable capabilities in bioethics, clinical reasoning, and for the design of vaccine and drugs to treat and save people
medical management. from new outbreaks. Furthermore, another potential usage of
The generative capability of such LAMs may augment med- LAMs, as pointed out in [136], can be in precision triage and
ical student learning and help them gain additional insights diagnosis, in which they could play a pivotal role as medical
from AI-generated content as recently pointed out in [131]. A care workforce might be stretched when encountering a new
LAM with wide knowledge and social compliance can act as outbreak. An important aspect of tackling an outbreak/epidemic
a companion learning assistant, answering medical questions is to handle misinformation. The study conducted by Chen
promptly and explaining intricate terms and practices in simple et al. [137] revealed that from 21 January 2020 to 21 March
sentences. For example, the recent GPT-4 model [4] can act 2020, Twitter produced over 72 million Covid-19-related tweets.
as a Socratic tutor, leading a student step-by-step to find the If unverified media information proliferates at scale, it inevitably
answers by themselves, which is an important step towards causes complications in tackling the outbreak. Although LAMs
practical adoption of LAMs in education as they can be steered to could be double-edged swords when it comes to misinformation,
teach/assist students in a desired manner. The OPTICAL model with gradually complete regulations and strengthened factual
proposed by Shue et al. [132] recently shows the feasibility grounding of LAMs, they can be used to effectively identify
of using LLMs to guide beginners in analyzing bioinformatics misinformation and tackle public health infodemic.
data. The sentence paraphrasing abilities of LLMs [133] such as Beyond their promising usage in preventing pandemics,
ChatGPT may also help students with dyslexia in their learning. LAMs are also an effective tool for solving other public health
However, concerns about the illegitimate uses of LAMs such challenges, for example, providing large-scale dietary monitor-
as plagiarism are practical and should raise awareness. A pilot ing and assessment [138], [139] to tackle the growing double
study conducted by Mitchell et al. [134] proposed a zero-shot burden of malnutrition [140] in many low- and middle-income
detector named DetectGPT, which is able to distinguish human- countries, and demystifying and proposing new solutions for
written or LLM-generated text. This attempt may lead to more mental illnesses that are common in populations. Researchers
research into developing reliable tools for verifying the content have recently proposed ClimaX [141], a foundation model for
source and potentially countering the side effects of LAMs in forecasting weather and climate change. With their remark-
education. able forecasting capability, LAMs like ClimaX and Pangu-
For medical education givers, LAMs can potentially create Weather [142] can advance our understanding of climate change
novel teaching and exam contents, and diversify the teaching and provide solutions to better address the global health issues
formats and their presentation. Based on the history of medical posed by climate change.
6082 IEEE JOURNAL OF BIOMEDICAL AND HEALTH INFORMATICS, VOL. 27, NO. 12, DECEMBER 2023

G. Medical Robotics used in general domains and thus are likely insufficient to unlock
the full potential of LAMs in biomedical and health scenarios.
From surgical robots that allow surgeons to perform preci-
Building large-scale high-quality medical datasets are particu-
sion minimally invasive surgery, to wearable robots that as-
larly challenging because 1) curation requires domain-expertise
sist patients with health monitoring and rehabilitation, medical
to identify data of clinical relevance, and quality assurance is
robotics has seen rapid growth and advances over the past
very important with health data; 2) some data modalities like
few decades. LAMs have begun to show exciting prospects in
MRI require special devices to collect, which is inefficient and
enhancing medical robotic vision, interaction, and autonomy.
expensive; 3) the collected data may not be allowed to publish
1) Enhance Vision: The integration of LAMs into surgical
or use for training because of consent, legal and privacy issues.
robots has the potential to enhance the vision of these systems in
Furthermore, the training strategy RLHF of some LLMs like
surgery. Endo-FM [143], a foundation model with high precision
ChatGPT requires even more intense engagement of human
for endoscopic video classification, segmentation, and detection,
experts.
could be one of these LAMs to provide robotic surgery systems
2) Computation: Training, or even fine-tuning, contempo-
with enhanced vision. In addition to online vision enhancement,
rary LAMs is extremely expensive in terms of time and resource
LAMs can also potentially improve the offline workflow analysis
consumption, which is beyond the budget of most researchers
of robotic surgery, and more accurately and objectively predict
and organizations [156]. Taking LLaMa as an example, an LLM
the likelihood of complications and successful outcomes, which
with 65B parameters, it took about 21 days on 2048 A100
help surgeons better plan and execute surgeries in the future.
GPUs to train the model once on a dataset of 1.4 T tokens [10].
Furthermore, with their strong generative capabilities, LAMs
Furthermore, even inference can be prohibitively costly due to
can be used to generate and simulate surgical procedures, al-
the model size, making it impractical for most hospitals to deploy
lowing surgeons to practice and refine their techniques before
these LAMs locally using their computing devices at hand.
operating on a patient with real surgical robots. Beyond surgical
3) Reliability: The reliability threshold for translation into
robots, the perception of many companion and assistive robots
clinical practice is significantly higher [157]. Despite the im-
can also be enhanced by LAMs, e.g., enabling a companion
pressive performance, LLMs are still far from reliable [158]
robot to better understand a patient’s emotion through accurate
and prone to hallucinate [4], [159], i.e., generating factually-
recognition of facial expressions [144], and enabling an assistive
incorrect yet plausible content which misleads users. In ad-
robot to offer safer, more natural navigation for visually impaired
dition, the unsatisfactory robustness of LAMs impairs their
people [145].
credibility. LLMs are known to be sensitive to prompts [160].
2) Improve Interaction: LAMs may significantly improve the
LLMs as well as LMs for other modalities remain vulnerable
interactive capabilities of many medical robots, by enabling
to out-of-distribution and adversarial examples [158], [161].
them to recognize human emotions, gestures, and speech, and
Improving the robustness of LAMs may require even more
respond to high-level human language commands. For example,
data [162]. Therefore, caution is highly required when using
this will be easier for patients undergoing rehabilitation to com-
LAMs in healthcare practice to alleviate the potential danger of
municate and engage with their robotic assistants, improving
over-reliance. In addition, LAMs, especially LLMs were trained
their overall recovery experience. More intelligent LAMs may
offline, in many clinical and health scenarios, using up-to-date
also better understand human intentions and create more human-
information is critical.
like companionship, which could improve the overall quality of
4) Privacy: First, LAMs have been reported to have exces-
care for the elderly [146]. Recently, SurgicalGPT [147], a visual
sive capacity to memorize their training data [163], and more
question answering model for surgery, has shown great promise
importantly, it is viable to extract sensitive information in the
that future robotic surgery could become more interactive be-
memorized data using direct prompts [163], [164]. This was
tween surgeons and the surgical robots.
later mitigated by fine-tuning LAMs to refuse to answer such
3) Increase Autonomy: LAMs have the potential to turn
prompts [165]. However, Li et al. [165] also show that this
robotic pipelines from the current engineer in the loop to user
mitigation can be bypassed through tricky prompts called jail-
in the loop using high-level language commands [148], which
breaking. Moreover, membership inference attacks [166] could
could enable surgeons with less programming proficiency to
reveal if a sample is in the training set, e.g., if a patient is in a
easily adapt robotic manipulations to their target tasks. Studies
cancer dataset. It has been recently demonstrated to work even
have proposed to use a single LAM to conduct diverse robotic
on the latest large diffusion models [167].
tasks, demonstrating impressive adaptability and generalization
Second, the information provided by users to query LLM-
skills [149], [150], [151], [152], [153], [154], [155]. These
integrated applications may be leaked. According to the data
advancements can potentially inspire the development of more
policy of OpenAI [168], they store the data that users provide
autonomous medical robots.
to ChatGPT or DALL-E to train their models. Unfortunately, it
has been reported that the stored personal information can be
IV. CHALLENGES, LIMITATIONS, AND RISKS leaked incidentally by a “chat history” bug [169] or deliberately
Despite the promising outcome of LAMs, there remain many by indirect prompt injection attack [170].
challenges and potential risks in developing and deploying 5) Fairness: LAMs are data-driven approaches so they could
LAMs in biomedical, clinical, and healthcare applications. learn any bias from the training data. Unfortunately, bias widely
1) Data: Most existing public datasets for health informatics exists in the delivery of healthcare [171] and also the data
are much smaller (please refer to Fig. 2 and Table II2 ) than those collected in this process [172], [173], [174]. Machine learning
models trained on such data are reported to mimic human bias
against race [172], gender [173], politics [175], etc. In addition to
2 References to the datasets in Table II and methods in Table I can be found these conventional biases, LLMs present language bias as well,
in https://github.com/Jianing-Qiu/Awesome-Healthcare-Foundation-Models. i.e., they perform better in particular languages like English but
QIU et al.: LARGE AI MODELS IN HEALTH INFORMATICS 6083

TABLE II
LARGE-SCALE DATASETS IN BIOMEDICAL AND HEALTH INFORMATICS

worse in others [176] because training data is dominated by a provide one way to reveal the intermediate reasoning steps
few languages. behind an output, but it remains unclear whether the generated
6) Toxicity: Current LAMs, even LLMs explicitly trained description of reasoning reflects the model’s true internal reason-
with alignment, do not understand and represent human ing. Alternatively, mechanistic interpretability methods [182]
ethics [177]. LLMs are reported to produce hate speech [178] reverse-engineer the computation of LAMs to illuminate the
that causes offensive and psychologically harmful content and model’s internal mechanism of reasoning.
even incites violence. Secondly, LAMs may endorse unethical 9) Sustainability: Despite many benefits, LAMs, if abused,
or harmful views and behaviors [177] and motivate users to will negatively impact the sustainability of our society. LAMs
perform. Lastly, LAMs can be used intentionally to facilitate consume lots of computation resource [10] and energy [183]
harmful activities like spreading disinformation and encourag- and emit tons of carbon [183] in all activities in their lifecycle
ing criminal activities. Although some countermeasures like (from training to deployment) because of their scale. For exam-
filtering are applied, they can be circumvented by prompt in- ple, as estimated by [184], training a GPT-3 model consumes
jection [176]. 1287 MWh and emits 552 tons of CO2 . As the paradigm
7) Transparency: Recently, some impactful LAMs like moves towards LAMs in healthcare, more and more research
ChatGPT and Med-PaLM 2 chose not to disclose the complete is expected to be conducted based on LAMs, which could be
technical details, the pre-trained models, and the used data. environmentally unfriendly due to the cost and carbon emission
This makes it impossible for others to independently reproduce, if right practices [183] are not established.
improve upon and audit their methods. This transparency threat 10) Regulation: Regulation is needed to ensure responsible
for LAMs can be more serious in healthcare as many medical LAMs especially when some of the above issues cannot be
data is private and models built upon them are not allowed to be technically addressed. Particularly, data collection and usage
open sourced. should be governed to protect the rights of data owner such
8) Interpretability: LAMs inherently lack interpretability2 as copyright, privacy and “being forgotten” [185]. The liability
due to their extremely dense hidden layers. Even worse, the of LAMs’ creators/owners for the possible harm caused by the
behavior of LAMs can be meaningless [179], [180], hard to model’s output should be clarified. LAMs should be deployed
predict [181], [182] and thus mysterious. For example, DALL-E in critical healthcare services only if regulatory approval is
2 generates the images of physical objects with absurd prompts obtained and standardized safety assessment is passed. Regu-
(e.g., “Apoploe vesrreaitais” for birds) [179]; the reasoning lation today is much behind the development of technology for
ability of LLMs can be improved by simply adding a text of LAMs, even for more general AI. Our webpage lists some major
“Let’s think step by step” to prompts [181]. There has been little legislation for reference if readers want to know more about AI
progress towards explaining LAMs. Chain-of-thought prompts regulation.
6084 IEEE JOURNAL OF BIOMEDICAL AND HEALTH INFORMATICS, VOL. 27, NO. 12, DECEMBER 2023

a solution or improvement to the tasks in a nearly cost-free


way as it requires no further large-scale training. Prompt en-
gineering [187] as an emerging field is an effective approach to
discovering hidden capabilities.

B. Responsibility
Responsible LAMs for social good is paramount [188].
We suggest two complementary strategies: development and
deployment, for future work to tackle challenges in LAM re-
liability, fairness, transparency, and beyond. Development strat-
egy focuses on learning responsible LAMs, while deployment
strategy emphasizes using LAMs responsibly.
Technically, responsible LAMs can be developed through two
Fig. 3. Future directions of LAMs in health informatics. perspectives: data and algorithms. As LAMs learn bias from
training data, an intuitive countermeasure is thus to mitigate bias
in training data. It can be done by filtering biased data [189],
V. FUTURE DIRECTIONS increasing underrepresented populations’ data, etc. Unfortu-
nately, how to efficiently inspect large-scale datasets remains
In this section, we discuss some promising directions for
challenging. Besides, training data should encompass diverse
future work to advance LAMs in the field of biomedical and
distributions to robustify the model against the distribution shifts
health informatics, and our discussion below is mainly focused
in the wild, and human preference to align the model with human
on two aspects (Fig. 3): capability and responsibility.
values. Some of them like human preference must be collected
from human activities, while the rest can be also generated by
A. Capability other algorithms like data augmentation and generative models.
The first is to develop new LAMs for health informatics Once we have high-quality data, algorithms like RLHF and
with better capability. The better capability here refers to either adversarial training can be adopted to exploit these data to
new abilities (e.g., a versatile medical task solver) or improved acquire the desired properties for responsible LAMs.
existing abilities (e.g., higher diagnostic accuracy), compared In addition to training LAMs to be responsible, it is also vital
to the prior paradigm. Interestingly, some emergent new abil- to use LAMs in a responsible way. Efforts should be made to
ities may be unexpected or even unknown to humans [125]. educate the users, especially those use LAMs for critical health-
Among numerous approaches to LAMs, some are perceived care services, about the basics and limitations of LAMs being
by us as most promising. Scaling up the size of dataset and used. Human-LAM partnership should also be researched for
model are two widely recognized approaches, but how to do the effective, efficient and responsible use of LAMs, including
it efficiently is of importance and far from solved. Further- how to query/instruct LAMs by prompt engineering and as-
more, pre-training with varied tasks and modalities has achieved sess/adopt the responses from LAMs. Besides, a comprehensive
remarkable progress towards versatility in performing down- verification framework [190] covering various desired properties
stream tasks. A huge benefit is foreseeable if diverse knowledge for LAMs is critical for assessing how irresponsible a LAM
that exists in these varied tasks (e.g., biology, medicine, etc.) and is, which is still lacking. We encourage future work to design
data modalities (e.g., medical corpora, imaging, physiological methods to better evaluate, verify and benchmark LAMs. Last,
signals, etc.) can be incorporated into a single foundation model rules and regulations should be implemented to govern the de-
as a world model [186]. This world model boosts capability velopment, deployment and use of LAMs. This is a vital measure
by complementing the information missing in an input, e.g., to enforce LAMs for social good and prevent anti-social usage.
offering biomedical knowledge (acquired from other tasks) for Overall, building responsible LAMs calls a closer collaboration
diagnosing a disease when only the symptom data is given as in the future among academia, industry and government.
input. Note that this is exactly how human doctors diagnose in
practice, i.e., they comprehend information from the symptoms
VI. CONCLUSION
based on their medical knowledge acquired from learning and
clinical practicing, i.e., multiple other tasks and sources. We highlight an ongoing paradigm shift within AI community,
The second is to reveal the hidden capabilities of existing which is fostering large AI models for transforming different
pre-trained LAMs. A capability is hidden if it has been already biomedical and health sectors. The new paradigm aims to learn
developed in a pre-trained model but just unknown to users. a versatile foundation model on a large-scale (multi-modal)
Discovering hidden capabilities involves nothing but probing dataset covering varied data distributions and learning tasks.
the model. A typical example is the substantially improved Boundaries between different intelligent tasks, and even be-
reasoning ability of LLMs by simply adding a line of “Let’s think tween different data modalities, are being dismantled. With
step by step” to prompts [181]. There are still many unknowns generalist intelligence and more unknown capabilities activated,
about existing LAMs as they have become increasingly complex we believe large AI models will augment, instead of replacing,
with enormously large sets of parameters. It is unclear whether medical professionals and practitioners in the future. Human-AI
the full potential of existing LAMs has been harnessed or not. cooperation will become pervasive. In this regard, the develop-
Therefore, it is worth investigating if those pre-trained LAMs ment of large AI models requires even closer and more intense
possess hidden capabilities about the health informatics tasks of collaboration between domain experts, as well as gradually
interest. If so, discovering these hidden capabilities provides established regulations.
QIU et al.: LARGE AI MODELS IN HEALTH INFORMATICS 6085

REFERENCES [36] M. Assran et al., “Self-supervised learning from images with a joint-
embedding predictive architecture,” in Proc. IEEE/CVF Conf. Comput.
[1] OpenAI, “ChatGPT: Optimizing language models for dialogue,” 2022. Vis. Pattern Recognit., 2023, pp. 15619–15629.
[Online]. Available: https://openai.com/blog/chatgpt/ [37] H. Chen et al., “Pre-trained image processing transformer,” in
[2] A. Kirillov et al., “Segment anything,” 2023, arXiv:2304.02643. Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2021,
[3] A. Vaswani et al., “Attention is all you need,” in Proc. Adv. Neural Inf. pp. 12299–12310.
Process. Syst., 2017. [38] K. He et al., “Masked autoencoders are scalable vision learners,”
[4] OpenAI, “Gpt-4 technical report,” 2023, arXiv:2303.08774. in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2022,
[5] P. Lee, C. Goldberg, and I. Kohane, The AI Revolution in Medicine: pp. 16000–16009.
GPT-4 and Beyond. London, U.K.: Pearson Education, Limited, [39] T. Chen, S. Kornblith, M. Norouzi, and G. Hinton, “A simple framework
2023. for contrastive learning of visual representations,” in Proc. Int. Conf.
[6] V. Gulshan et al., “Development and validation of a deep learning algo- Mach. Learn., 2020, pp. 1597–1607.
rithm for detection of diabetic retinopathy in retinal fundus photographs,” [40] A. Dosovitskiy et al., “An image is worth 16x16 words: Transformers for
Jama, vol. 316, no. 22, pp. 2402–2410, 2016. image recognition at scale,” in Proc. Int. Conf. Learn. Representations,
[7] R. Bommasani et al., “On the opportunities and risks of foundation 2021.
models,” 2021, arXiv:2108.07258. [41] K. Han et al., “Transformer in transformer,” in Proc. Adv. Neural Inf.
[8] A. Radford et al., “Improving language understanding by generative pre- Process. Syst., 2021, vol. 34, pp. 15908–15919.
training,” 2018. [42] Z. Liu et al., “Swin transformer: Hierarchical vision transformer us-
[9] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training ing shifted windows,” in Proc. IEEE Int. Conf. Comput. Vis., 2021,
of deep bidirectional transformers for language understanding,” 2018, pp. 10012–10022.
arXiv:1810.04805. [43] Z. Liu et al., “Swin transformer V2: Scaling up capacity and
[10] H. Touvron et al., “Llama: Open and efficient foundation language resolution,” in Proc. IEEE/CVF Int. Conf. Comput. Vis., 2022,
models,” 2023, arXiv:2302.13971. pp. 12009–12019.
[11] H. Touvron et al., “Llama 2: Open foundation and fine-tuned chat mod- [44] M. Dehghani et al., “Scaling vision transformers to 22 billion parame-
els,” 2023, arXiv:2307.09288. ters,” in Proc. Int. Conf. Mach. Learn., 2023, pp. 7480–7512.
[12] H. W. Chung et al., “Scaling instruction-finetuned language models,” [45] Z. Liu et al., “A convnet for the 2020s,” in Proc. IEEE/CVF Conf. Comput.
2022, arXiv:2210.11416. Vis. Pattern Recognit., 2022, pp. 11976–11986.
[13] A. Chowdhery et al., “Palm: Scaling language modeling with pathways,” [46] W. Wang et al., “Internimage: Exploring large-scale vision foundation
2022, arXiv:2204.02311. models with deformable convolutions,” in Proc. IEEE/CVF Conf. Com-
[14] J. Hoffmann et al., “Training compute-optimal large language models,” put. Vis. Pattern Recognit., 2023, pp. 14408–14419.
2022, arXiv:2203.15556. [47] Z. Dai, H. Liu, Q. V. Le, and M. Tan, “CoAtNet: Marrying convolution
[15] S. Smith et al., “Using deepspeed and megatron to train megatron- and attention for all data sizes,” in Proc. Adv. Neural Inf. Process. Syst.,
turing nlg 530b, a large-scale generative language model,” 2022, 2021, vol. 34, pp. 3965–3977.
arXiv:2201.11990. [48] S. d’Ascoli et al., “ConViT: Improving vision transformers with soft
[16] T. L. Scao et al., “Bloom: A 176b-parameter open-access multilingual convolutional inductive biases,” in Proc. Int. Conf. Mach. Learn., 2021,
language model,” 2022, arXiv:2211.05100. pp. 2286–2296.
[17] R. Thoppilan et al., “LaMDA: Language models for dialog applications,” [49] A. Kolesnikov et al., “Big transfer (BiT): General visual representation
2022, arXiv:2201.08239. learning,” in Proc. Eur. Conf. Comput. Vis., 2020, pp. 491–507.
[18] S. Zhang et al., “OPT: Open pre-trained transformer language models,” [50] M. Tan et al., “Efficientnet: Rethinking model scaling for convo-
2022, arXiv:2205.01068. lutional neural networks,” in Proc. Int. Conf. Mach. Learn., 2019,
[19] L. Ouyang et al., “Training language models to follow instructions with pp. 6105–6114.
human feedback,” 2022, arXiv:2203.02155. [51] M. Tan et al., “EfficientNetV2: Smaller models and faster training,” in
[20] J. W. Rae et al., “Scaling language models: Methods, analysis & insights Proc. Int. Conf. Mach. Learn., 2021, pp. 10096–10106.
from training gopher,” 2021, arXiv:2112.11446. [52] P. Goyal et al., “Self-supervised pretraining of visual features in the wild,”
[21] V. Sanh et al., “Multitask prompted training enables zero-shot task 2021, arXiv:2103.01988.
generalization,” 2021, arXiv:2110.08207. [53] Y. Huang et al., “GPipe: Efficient training of giant neural networks using
[22] J. Wei et al., “Chain of thought prompting elicits reasoning in large pipeline parallelism,” in Proc. Adv. Neural Inf. Process. Syst., 2019,
language models,” 2022, arXiv:2201.11903. vol. 32.
[23] S. Borgeaud et al., “Improving language models by retrieving from tril- [54] A. Radford et al., “Learning transferable visual models from natu-
lions of tokens,” in Proc. Int. Conf. Mach. Learn., 2022, pp. 2206–2240. ral language supervision,” in Proc. Int. Conf. Mach. Learn., 2021,
[24] P. F. Christiano et al., “Deep reinforcement learning from human prefer- pp. 8748–8763.
ences,” in Proc. Adv. Neural Inf. Process. Syst., 2017. [55] C. Jia et al., “Scaling up visual and vision-language representation
[25] J. Schulman et al., “Proximal policy optimization algorithms,” 2017, learning with noisy text supervision,” in Proc. Int. Conf. Mach. Learn.,
arXiv:1707.06347. 2021, pp. 4904–4916.
[26] A. Glaese et al., “Improving alignment of dialogue agents via targeted [56] L. Yuan et al., “Florence: A new foundation model for computer vision,”
human judgements,” 2022, arXiv:2209.14375. Nov. 2021, arXiv:2111.11432.
[27] A. S. Pinto et al., “Tuning computer vision models with task rewards,” [57] C. Schuhmann et al., “LAION-5B: An open large-scale dataset for
2023, arXiv:2302.08242. training next generation image-text models,” in Proc. Adv. Neural Inf.
[28] J. Yosinski et al., “How transferable are features in deep neural net- Process. Syst., 2022, pp. 25278–25294.
works?,” in Proc. Adv. Neural Inf. Process. Syst., 2014. [58] A. Singh et al., “FLAVA: A foundational language and vision alignment
[29] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “ImageNet: model,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2022,
A large-scale hierarchical image database,” in Proc. IEEE Conf. Comput. pp. 15638–15650.
Vis. Pattern Recognit., 2009, pp. 248–255. [59] Z. Wang et al., “SimVLM: Simple visual language model pretraining with
[30] T. Ridnik et al., “ImageNet-21 K pretraining for the masses,” in Proc. weak supervision,” in Proc. Int. Conf. Learn. Representations, 2022.
Adv. Neural Inf. Process. Syst., 2021. [60] X. Chen et al., “PaLI: A jointly-scaled multilingual language-image
[31] C. Sun et al., “Revisiting unreasonable effectiveness of data in model,” 2022, arXiv:2209.06794.
deep learning era,” in Proc. IEEE Int. Conf. Comput. Vis., 2017, [61] J. Li et al., “BLIP-2: Bootstrapping language-image pre-training
pp. 843–852. with frozen image encoders and large language models,” Jan. 2023,
[32] X. Zhai et al., “Scaling vision transformers,” in Proc. IEEE/CVF Conf. arXiv:2301.12597.
Comput. Vis. Pattern Recognit., 2022, pp. 12104–12113. [62] S. Huang et al., “Language is not all you need: Aligning perception with
[33] D. Mahajan et al., “Exploring the limits of weakly supervised pretrain- language models,” Mar. 2023, arXiv:2302.14045.
ing,” in Proc. Eur. Conf. Comput. Vis., 2018, pp. 181–196. [63] J. Li, D. Li, C. Xiong, and S. Hoi, “BLIP: Bootstrapping language-image
[34] M. Singh et al., “Revisiting weakly supervised pre-training of visual pre-training for unified vision-language understanding and generation,”
perception models,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern in Proc. Int. Conf. Mach. Learn., Jun. 2022, pp. 12888–12900, iSSN:
Recognit., 2022, pp. 804–814. 2640-3498.
[35] M. Chen et al., “Generative pretraining from pixels,” in Proc. Int. Conf. [64] H. Pham et al., “Combined scaling for open-vocabulary image classifi-
Mach. Learn.2020, pp. 1691–1703. cation,” 2021, arXiv:-2111.
6086 IEEE JOURNAL OF BIOMEDICAL AND HEALTH INFORMATICS, VOL. 27, NO. 12, DECEMBER 2023

[65] A. Ramesh et al., “Zero-shot text-to-image generation,” in Proc. Int. Conf. [96] Z. Zhao et al., “ChatCAD : Towards a universal and reliable interactive
Mach. Learn., 2021, pp. 8821–8831. CAD using LLMS,” 2023, arXiv:2305.15964.
[66] J. Yu et al., “Scaling autoregressive models for content-rich text-to-image [97] Y. Li et al., “Chatdoctor: A medical chat model fine-tuned on a large
generation,” Trans. Mach. Learn. Res., 2022. language model meta-ai (LLaMA) using medical domain knowledge,”
[67] A. Ramesh et al., “Hierarchical text-conditional image generation with Cureus, vol. 15, no. 6, 2023.
clip latents,” 2022, arXiv:2204.06125. [98] O. Thawkar et al., “XrayGPT: Chest radiographs summarization using
[68] R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Om- medical vision-language models,” 2023, arXiv:2306.07971.
mer, “High-resolution image synthesis with latent diffusion mod- [99] H. Wang et al., “HuaTuo: Tuning LLaMA model with Chinese medical
els,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2021, knowledge,” 2023, arXiv:2304.06975.
pp. 10684–10695. [100] A. Vaid et al., “A foundational vision transformer improves diagnostic
[69] A Q. Nichol et al., “GLIDE: Towards photorealistic image generation performance for electrocardiograms,” NPJ Digit. Med., vol. 6, no. 1,
and editing with text-guided diffusion models,” in Proc. Int. Conf. Mach. 2023, Art. no. 108.
Learn., 2022, pp. 16784–16804. [101] Y. Li et al., “BEHRT: Transformer for electronic health records,” Sci.
[70] C. Saharia et al., “Photorealistic text-to-image diffusion models with deep Rep., vol. 10, no. 1, pp. 1–12, 2020.
language understanding,” 2022, arXiv:2205.11487. [102] L. Rasmy et al., “Med-BERT: Pretrained contextualized embeddings on
[71] J. Sohl-Dickstein et al., “Deep unsupervised learning using nonequi- large-scale structured electronic health records for disease prediction,”
librium thermodynamics,” in Proc. Int. Conf. Mach. Learn., 2015, NPJ Digit. Med., vol. 4, no. 1, 2021, Art. no. 86.
pp. 2256–2265. [103] P. Shi et al., “Generalist vision foundation models for medical imaging:
[72] C. Wu et al., “Visual chatGPT: Talking, drawing and editing with visual A case study of segment anything model on zero-shot medical segmen-
foundation models,” 2023, arXiv:2303.04671. tation,” Diagnostics, vol. 13, no. 11, 2023, Art. no. 1947.
[73] R. Girdhar et al., “Imagebind: One embedding space to bind them [104] J. Wu et al., “Medical SAM adapter: Adapting segment anything model
all,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2023, for medical image segmentation,” 2023, arXiv:2304.12620.
pp. 15180–15190. [105] Z. Wang, Z. Wu, D. Agarwal, and J. Sun, “Medclip: Contrastive learning
[74] Y. Zhang et al., “Meta-transformer: A unified framework for multimodal from unpaired medical images and text,” 2022, arXiv:2210.10163.
learning,” 2023, arXiv:2307.10802. [106] Z. Huang et al., “A visual–language foundation model for pathology im-
[75] X.-C. Bai et al., “How cryo-EM is revolutionizing structural biology,” age analysis using medical Twitter,” Nature Med., vol. 29, pp. 2307–2316,
Trends Biochem. Sci., vol. 40, no. 1, pp. 49–57, 2015. 2023.
[76] K. Wüthrich, “The way to NMR structures of proteins,” Nature Struct. [107] S. Chen, K. Ma, and Y. Zheng, “Med3D: Transfer learning for 3D medical
Biol., vol. 8, no. 11, pp. 923–925, 2001. image analysis,” 2019, arXiv:1904.00625.
[77] J. M. Grimes et al., “Where is crystallography going?,” Acta Crys- [108] P. Chambon et al., “Adapting pretrained vision-language foundational
tallographica Sect. D: Struct. Biol., vol. 74, no. 2, pp. 152–166, models to medical imaging domains,” 2022, arXiv:2210.04133.
2018. [109] Z. Huang et al., “Stu-Net: Scalable and transferable medical image
[78] J. Jumper et al., “Highly accurate protein structure prediction with segmentation models empowered by large-scale supervised pre-training,”
AlphaFold,” Nature, vol. 596, no. 7873, pp. 583–589, 2021. 2023, arXiv:2304.06716.
[79] R. Evans et al., “Protein complex prediction with alphafold-multimer,” [110] T. Brown et al., “Language models are few-shot learners,” in Proc. Adv.
BioRxiv, 2021. Neural Inf. Process. Syst., 2020, vol. 33, pp. 1877–1901.
[80] A. Madani et al., “Progen: Language modeling for protein generation,” [111] “Pubmed abstract,” 2023. [Online]. Available: https://pubmed.ncbi.nlm.
2020, arXiv:2004.03497. nih.gov/download/
[81] A. Elnaggar et al., “ProtTrans: Towards cracking the language of [112] “Pubmed central,” 2023. [Online]. Available: https://www.ncbi.nlm.nih.
lifes code through self-supervised deep learning and high performance gov/pmc/
computing,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 44, 10, [113] J. Lee et al., “BioBERT: A pre-trained biomedical language representa-
pp. 7112–7127, Sep. 2021. tion model for biomedical text mining,” Bioinformatics, vol. 36, no. 4,
[82] M. Steinegger and J. Söding, “Clustering huge protein sequence sets in pp. 1234–1240, 2020.
linear time,” Nature Commun., vol. 9, no. 1, 2018, Art. no. 2542. [114] E. Alsentzer et al., “Publicly available clinical BERT embeddings,” 2019,
[83] B. E. Suzek et al., “Uniref clusters: A comprehensive and scalable arXiv:1904.03323.
alternative for improving sequence similarity searches,” Bioinformatics, [115] H.-C. Shin et al., “BioMegatron: Larger biomedical domain language
vol. 31, no. 6, pp. 926–932, 2015. model,” 2020, arXiv:2010.06060.
[84] Z. Lin et al., “Evolutionary-scale prediction of atomic-level protein struc- [116] S. Gururangan et al., “Don’t stop pretraining: Adapt language models to
ture with a language model,” Science, vol. 379, no. 6637, pp. 1123–1130, domains and tasks,” 2020, arXiv:2004.10964.
2023. [117] K. R. Kanakarajan, B. Kundumani, and M. Sankarasubbu, “Bioelectra:
[85] R. Wu et al., “High-resolution de novo structure prediction from primary Pretrained biomedical text encoder using discriminators,” in Proc. 20th
sequence,” BioRxiv, 2022. Workshop Biomed. Lang. Process., 2021, pp. 143–154.
[86] M. Baek et al., “Accurate prediction of protein structures and interactions [118] Y. Gu et al., “Domain-specific language model pretraining for biomedical
using a three-track neural network,” Sci., vol. 373, no. 6557, pp. 871–876, natural language processing,” Health, vol. 3, no. 1, pp. 1–23, 2021.
2021. [119] M. Yasunaga, J. Leskovec, and P. Liang, “LinkBERT: Pretraining lan-
[87] B. Chen et al., “xTrimoPGLM: Unified 100b-scale pre-trained trans- guage models with document links,” 2022, arXiv:2203.15827.
former for deciphering the language of protein,” Biorxiv, 2023. [120] R. Luo et al., “BioGPT: Generative pre-trained transformer for biomed-
[88] A. Elnaggar et al., “Ankh: Optimized protein language model unlocks ical text generation and mining,” Brief. Bioinf., vol. 23, no. 6, 2022,
general-purpose modelling,” Biorxiv, 2023. Art. no. bbac409.
[89] J. Chen et al., “Interpretable RNA foundation model from unannotated [121] K. Singhal et al., “Large language models encode clinical knowledge,”
data for highly accurate rna structure and function predictions,” Biorxiv, 2022, arXiv:2212.13138.
2022. [122] X. Yang et al., “A large language model for electronic health records,”
[90] “RNACentral 2021: Secondary structure integration, improved sequence npj Digit. Med., vol. 5, no. 1, 2022, Art. no. 194.
search and new member databases,” Nucleic acids Res., vol. 49, no. D1, [123] E. J. Hu et al., “LoRA: Low-rank adaptation of large language models,”
pp. D212–D220, 2021. 2021, arXiv:2106.09685.
[91] L. Fu, Y. Cao, J. Wu, Q. Peng, Q. Nie, and X. Xie, “Ufold: Fast [124] T. Han et al., “MedAlpaca–An open-source collection of medical con-
and accurate RNA secondary structure prediction with deep learning,” versational AI models and training data,”2023, arXiv:2304.08247.
Nucleic Acids Res., vol. 50, no. 3, pp. e14–e14, 2022. [125] J. Wei et al., “Emergent abilities of large language models,” 2022,
[92] T. Shen et al., “E2Efold-3D: End-to-end deep learning method for accu- arXiv:2206.07682.
rate de novo RNA 3D structure prediction,” 2022, arXiv:2207.01586. [126] M. Agrawal, S. Hegselmann, H. Lang, Y. Kim, and D. Sontag, “Large
[93] G. R. Buel and K. J. Walters, “Can alphafold2 predict the impact of language models are few-shot clinical information extractors,” 2022,
missense mutations on structure?,” Nature Struct. Mol. Biol., vol. 29, arXiv:2205.12689.
no. 1, pp. 1–2, 2022. [127] K. Singhal et al., “Towards expert-level medical question answering with
[94] E. Tiu et al., “Expert-level detection of pathologies from unannotated large language models,” 2023, arXiv:2305.09617.
chest X-ray images via self-supervised learning,” Nature Biomed. Eng., [128] V. Liévin, C. E. Hother, and O. Winther, “Can large language models
vol. 6, no. 12, pp. 1399–1406, 2022. reason about medical questions?,” 2022, arXiv:2207.08143.
[95] S. Wang et al., “ChatCAD: Interactive computer-aided diagnosis on [129] S. B. Patel and K. Lam, “Chatgpt: The future of discharge summaries?,”
medical image using large language models,” 2023, arXiv:2302.07257. Lancet Digit. Health, vol. 5, no. 3, pp. e107–e108, 2023.
QIU et al.: LARGE AI MODELS IN HEALTH INFORMATICS 6087

[130] “Fairway health - process prior authorization faster,” 2023. [Online]. [161] J. Wang et al., “On the robustness of chatGPT: An adversarial and out-
Available: https://www.ycombinator.com/launches/IIu-fairway-health- of-distribution perspective,” 2023, arXiv:2302.12095.
process-prior-authorization-faster [162] L. Li and M. W. Spratling, “Data augmentation alone can improve
[131] T. H. Kung et al., “Performance of chatGPT on usmle: Potential for adversarial training,” in Proc. Int. Conf. Learn. Representations, 2023.
ai-assisted medical education using large language models,” PLoS Digit. [163] N. Carlini et al., “Extracting training data from large language models,”
Health, vol. 2, no. 2, 2023, Art. no. e0000198. in Proc. USENIX Secur. Symp., 2021, vol. 6, pp. 2633–2650.
[132] E. Shue, L. Liu, B. Li, Z. Feng, X. Li, and G. Hu, “Empowering beginners [164] N. Carlini, D. Ippolito, M. Jagielski, K. Lee, F. Tramer, and C. Zhang,
in bioinformatics with chatGPT,” Biorxiv, 2023. “Quantifying memorization across neural language models,” 2022,
[133] H. Dai et al., “Chataug: Leveraging chatGPT for text data augmentation,” arXiv:2202.07646.
2023, arXiv:2302.13007. [165] H. Li, D. Guo, W. Fan, M. Xu, and Y. Song, “Multi-step jailbreaking
[134] E. Mitchell et al., “DetectGPT: Zero-shot machine-generated text detec- privacy attacks on chatGPT,” 2023, arXiv:2304.05197.
tion using probability curvature,” 2023, arXiv:2301.11305. [166] R. Shokri, M. Stronati, C. Song, and V. Shmatikov, “Membership in-
[135] X. Lin et al., “Pangu drug model: Learn a molecule like a human,” Biorxiv, ference attacks against machine learning models,” in Proc. IEEE Symp.
2022. Secur. Privacy, 2017, pp. 3–18.
[136] D. M. Korngiebel and S. D. Mooney, “Considering the possibilities and [167] J. Duan, F. Kong, S. Wang, X. Shi, and K. Xu, “Are diffusion models
pitfalls of generative pre-trained transformer 3 (GPT-3) in healthcare vulnerable to membership inference attacks?,” 2023, arXiv:2302.01316.
delivery,” NPJ Digit. Med., vol. 4, no. 1, 2021, Art. no. 93. [168] “How your data is used to improve model performance,” 2023. [Online].
[137] E. Chen et al., “Tracking social media discourse about the COVID-19 Available: https://help.openai.com/en/articles/5722486-how-your-data-
pandemic: Development of a public coronavirus twitter data set,” JMIR is-used-to-improve-model-performance
Public Health Surveill., vol. 6, no. 2, 2020, Art. no. e19273. [169] “March 20 chatGPT outage: Here’s what happened,” 2023. [Online].
[138] J. Peng et al., “Clustering egocentric images in passive dietary monitoring Available: https://openai.com/blog/march-20-chatgpt-outage
with self-supervised learning,” in Proc. IEEE-EMBS Int. Conf. Biomed. [170] K. Greshake et al., “More than you’ve asked for: A comprehensive
Health Inform., 2022, pp. 01–04. analysis of novel prompt injection threats to application-integrated large
[139] J. Qiu et al., “Egocentric image captioning for privacy-preserved pas- language models,” 2023, arXiv:2302.12173.
sive dietary intake monitoring,” IEEE Trans. Cybern., early access, [171] W. J. Hall et al., “Implicit racial/ethnic bias among health care profes-
doi: 10.1109/TCYB.2023.3243999. sionals and its influence on health care outcomes: A systematic review,”
[140] B. M. Popkin, C. Corvalan, and L. M. Grummer-Strawn, “Dynamics of Amer. J. Public Health, vol. 105, no. 12, pp. e60–e76, 2015.
the double burden of malnutrition and the changing nutrition reality,” [172] Z. Obermeyer, B. Powers, C. Vogeli, and S. Mullainathan, “Dissecting
Lancet, vol. 395, no. 10217, pp. 65–74, 2019. racial bias in an algorithm used to manage the health of populations,”
[141] T. Nguyen et al., “ClimaX: A foundation model for weather and climate,” Science, vol. 366, no. 6464, pp. 447–453, 2019.
2023, arXiv:2301.10343. [173] D. Cirillo et al., “Sex and gender differences and biases in artificial
[142] K. Bi et al., “Accurate medium-range global weather forecasting with intelligence for biomedicine and healthcare,” NPJ Digit. Med., vol. 3,
3D neural networks,” Nature, vol. 619, no. 7970, pp. 533–538, 2023. no. 1, 2020, Art. no. 81.
[143] Z. Wang et al., “Foundation model for endoscopy video analysis via [174] D. S. Char, N. H. Shah, and D. Magnus, “Implementing machine learning
large-scale self-supervised pre-train,” 2023, arXiv:2306.16741. in health care–addressing ethical challenges,” New England J. Med.,
[144] G. D’Onofrio et al., “Emotion recognizing by a robotic solution initiative vol. 378, no. 11, 2018, Art. no. 981.
(emotive project),” Sensors, vol. 22, no. 8, 2022, Art. no. 2861. [175] J. Rutinowski, S. Franke, J. Endendyk, I. Dormuth, and M.
[145] J. Qiu et al., “Egocentric human trajectory forecasting with a wearable Pauly, “The self-perception and political biases of chatGPT,” 2023,
camera and multi-modal fusion,” IEEE Robot. Automat. Lett., vol. 7, arXiv:2304.07333.
no. 4, pp. 8799–8806, Oct. 2022. [176] T. Y. Zhuo et al., “Exploring ai ethics of chatgpt: A diagnostic analysis,”
[146] P. Asgharian, A. M. Panchea, and F. Ferland, “A review on the use of 2023, arXiv:2301.12867.
mobile service robots in elderly care,” Robotics, vol. 11, no. 6, 2022, [177] D. Hendrycks et al., “Aligning AI with shared human values,” 2020,
Art. no. 127. arXiv:2008.02275.
[147] L. Seenivasan et al., “SurgicalGPT: End-to-end language-vision GPT for [178] S. Gehman, S. Gururangan, M. Sap, Y. Choi, and N. A. Smith, “Realtox-
visual question answering in surgery,” 2023, arXiv:2304.09974. icityprompts: Evaluating neural toxic degeneration in language models,”
[148] S. Vemprala, R. Bonatti, A. Bucker, and A. Kapoor, “ChatGPT for 2020, arXiv:2009.11462.
robotics: Design principles and model abilities,” Microsoft Redmond, [179] G. Daras and A. G. Dimakis, “Discovering the hidden vocabulary of
WA, USA, Tech. Rep. MSR-TR-2023-8, Feb. 2023. [Online]. Available: dalle-2,” 2022, arXiv:2206.00169.
https://www.microsoft.com/en-us/research/publication/chatgpt-for- [180] Y. Wang, P. Shi, and H. Zhang, “Investigating the existence of “secret
robotics-design-principles-and-model-abilities/ language”in language models,” 2023, arXiv:2307.12507.
[149] S. Reed et al., “A generalist agent,” Trans. Mach. Learn. Res., 2022. [181] T. Kojima et al., “Large language models are zero-shot reasoners,” in
[150] M. Shridhar et al., “CLIPort: What and where pathways for robotic Proc. Adv. Neural Inf. Process. Syst., 2022, vol. 35, pp. 22199–22213.
manipulation,” in Proc. Conf. Robot Learn., 2022, pp. 894–906. [182] N. Elhage et al., “A mathematical framework for transformer circuits,”
[151] M. Shridhar et al., “Perceiver-actor: A multi-task transformer for robotic Transformer Circuits Thread, vol. 1, 2021.
manipulation,” 2022, arXiv:2209.05451. [183] D. Patterson et al., “Carbon emissions and large neural network training,”
[152] M. Ahn et al., “Do as I can and not as I say: Grounding language in 2021, arXiv:2104.10350.
robotic affordances,” 2022, arXiv:2204.01691. [184] D. Patterson et al., “The carbon footprint of machine learning training
[153] D. Driess et al., “Palm-e: An embodied multimodal language model,” will plateau, then shrink,” Computer, vol. 55, no. 7, pp. 18–28, Jul. 2022.
2023, arXiv:2303.03378. [185] E. F. Villaronga, P. Kieseberg, and T. Li, “Humans forget, machines
[154] Y. Jiang et al., “Vima: General robot manipulation with multimodal remember: Artificial intelligence and the right to be forgotten,” Comput.
prompts,” 2022, arXiv:2210.03094. Law Secur. Rev., vol. 34, no. 2, pp. 304–313, 2018.
[155] A. Brohan et al., “Rt-1: Robotics transformer for real-world control at [186] Y. LeCun, “A path towards autonomous machine intelligence version
scale,” 2022, arXiv:2212.06817. 0.9.2,2022-06-27,” Open Rev., vol. 62, 2022.
[156] N. Ding et al., “Parameter-efficient fine-tuning of large-scale pre-trained [187] P. Liu et al., “Pre-train, prompt, and predict: A systematic survey of
language models,” Nature Mach. Intell., vol. 5, no. 3, pp. 220–235, 2023. prompting methods in natural language processing,” ACM Comput. Surv.,
[157] S. Gilbert et al., “Large language model AI chatbots require approval as vol. 55, no. 9, pp. 1–35, 2023.
medical devices,” Nature Med., vol. 29, pp. 2396–2398, 2023. [188] D. Leslie, “Understanding artificial intelligence ethics and safety,” 2019,
[158] X. Shen, Z. Chen, M. Backes, and Y. Zhang, “In chatGPT we arXiv:1906.05684.
trust? measuring and characterizing the reliability of chatGPT,” 2023, [189] S. A. Siddiqui et al., “Metadata archaeology: Unearthing data subsets by
arXiv:2304.08979. leveraging training dynamics,” in Proc. Int. Conf. Learn. Representations,
[159] P. Lee, S. Bubeck, and J. Petro, “Benefits, limits, and risks of GPT-4 2022.
as an AI chatbot for medicine,” New England J. Med., vol. 388, no. 13, [190] X. Huang et al., “A survey of safety and trustworthiness of large lan-
pp. 1233–1239, 2023. guage models through the lens of verification and validation,” 2023,
[160] H. Nori, N. King, S. M. McKinney, D. Carignan, and E. Horvitz, arXiv:2305.11391.
“Capabilities of GPT-4 on medical challenge problems,” 2023,
arXiv:2303.13375.

You might also like