0% found this document useful (0 votes)
33 views14 pages

[2023 Nature Comm] Intelligent Surgical Workflow Recognition for Endoscopic Submucosal Dissection With Real-time Animal Study

The document presents AI-Endo, an intelligent surgical workflow recognition system for endoscopic submucosal dissection (ESD), trained on a comprehensive dataset of 201,026 labeled frames from expert endoscopists. The system demonstrates high performance in real-time phase recognition across various skill levels and endoscopic systems, validated through animal studies. This research highlights the potential of AI in enhancing surgical training and standardizing procedures in minimally invasive surgery.

Uploaded by

menglan960210
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
33 views14 pages

[2023 Nature Comm] Intelligent Surgical Workflow Recognition for Endoscopic Submucosal Dissection With Real-time Animal Study

The document presents AI-Endo, an intelligent surgical workflow recognition system for endoscopic submucosal dissection (ESD), trained on a comprehensive dataset of 201,026 labeled frames from expert endoscopists. The system demonstrates high performance in real-time phase recognition across various skill levels and endoscopic systems, validated through animal studies. This research highlights the potential of AI in enhancing surgical training and standardizing procedures in minimally invasive surgery.

Uploaded by

menglan960210
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

Article https://doi.org/10.

1038/s41467-023-42451-8

Intelligent surgical workflow recognition for


endoscopic submucosal dissection with real-
time animal study
Received: 20 January 2023 Jianfeng Cao 1, Hon-Chi Yip2 , Yueyao Chen 1, Markus Scheppach3,
Xiaobei Luo4, Hongzheng Yang1, Ming Kit Cheng5, Yonghao Long 1,
Accepted: 11 October 2023
Yueming Jin6, Philip Wai-Yan Chiu 7 , Yeung Yam 5,7,8 ,
Helen Mei-Ling Meng 8 & Qi Dou1

Check for updates


1234567890():,;
1234567890():,;

Recent advancements in artificial intelligence have witnessed human-level


performance; however, AI-enabled cognitive assistance for therapeutic pro-
cedures has not been fully explored nor pre-clinically validated. Here we
propose AI-Endo, an intelligent surgical workflow recognition suit, for endo-
scopic submucosal dissection (ESD). Our AI-Endo is trained on high-quality
ESD cases from an expert endoscopist, covering a decade time expansion and
consisting of 201,026 labeled frames. The learned model demonstrates out-
standing performance on validation data, including cases from relatively
junior endoscopists with various skill levels, procedures conducted with dif-
ferent endoscopy systems and therapeutic skills, and cohorts from interna-
tional multi-centers. Furthermore, we integrate our AI-Endo with the Olympus
endoscopic system and validate the AI-enabled cognitive assistance system
with animal studies in live ESD training sessions. Dedicated data analysis from
surgical phase recognition results is summarized in an automatically gener-
ated report for skill assessment.

AI-enabled video data analytics is promising to provide cognitive more efficient and standardized surgical operations9; however, a
assistance for various clinical needs in minimally invasive surgery1. relevant study is still in its infancy.
Analyzing the progress of surgical workflow, i.e., recognizing which With advances in computer-assisted surgery in clinical practice10–12,
surgical step/phase is ongoing at each second, is important for the intelligent surgical workflow analysis has attracted increasing attention
standardization and support of surgical care2,3. For example, in endo- from computer scientists and surgeons. Despite promising progress has
scopic submucosal dissection (ESD), a therapeutic approach to resect been made13, the way to automated surgical data analysis is still
early-stage gastrointestinal (GI) cancer4,5, the smoothness and profi- encumbered by technical challenges. A core unsolved dilemma is to
ciency of its dissection phase can exhibit a surgeon’s skill6–8. Using AI to balance the accuracy and efficiency of AI prediction models. On the one
accomplish such analytical assessment has the potential to promote hand, accurate surgical workflow recognition relies on the consideration

1
Department of Computer Science and Engineering, The Chinese University of Hong Kong, Hong Kong, China. 2Department of Surgery, The Chinese University
of Hong Kong, Hong Kong, China. 3Internal Medicine III-Gastroenterology, University Hospital of Augsburg, Augsburg, Germany. 4Guangdong Provincial Key
Laboratory of Gastroenterology, Nanfang Hospital, Southern Medical University, Guangzhou, China. 5Department of Mechanical and Automation Engineering,
The Chinese University of Hong Kong, Hong Kong, China. 6Department of Biomedical Engineering, National University of Singapore, Singapore, Singapore.
7
Multi-scale Medical Robotics Center and The Chinese University of Hong Kong, Hong Kong, China. 8Centre for Perceptual and Interactive Intelligence and
The Chinese University of Hong Kong, Hong Kong, China. e-mail: [email protected]; [email protected]; [email protected];
[email protected]; [email protected]

Nature Communications | (2023)14:6676 1


Article https://doi.org/10.1038/s41467-023-42451-8

of rich temporal information in the video, because temporal context Importantly, our model learned from high-quality data collected from
awareness is critical for understanding sequential actions. This requires an expert endoscopist (with ESD experience of over 15 years),
AI models to extract long-range features from a sequence of frames14. accompanied by clearly defined surgical phase definition and
Existing methods, such as 3D CNNs15,16 and temporal convolution exhaustive frame-wise annotation. To experimentally evaluate the
network17,18, still struggle with how to effectively capture global temporal performance of AI-Endo in the wild, we have extensively tested its
information given the expansive surgical duration. On the other hand, performance on external datasets including different endoscopists,
recognized surgical phases need to be predicted in real-time, in order to various surgical tools and skills, different endoscopy systems, and
fulfill intraoperative deployment in surgery. It is challenging to achieve multi-center datasets. Moreover, we studied the potential usage of
such a high efficiency without compressing model parameters and surgical phase recognition, by integrating AI-Endo into surgical skill
sacrificing model performance. Although some representative works, training sessions at CUHK Jockey Club Minimally Invasive Surgical
such as TMRNet19 and Trans-SVNet20, achieved promising results with Skills Centre. To evaluate the computational efficiency and compat-
versatile models, their dependence on the considerable computational ibility of AI-Endo in real-time applications, we conducted a cost-
resources constrains their potential for clinical application. To date, how effective ex vivo animal trial using video streamed from an endoscopy
to effectively address this dilemma for successfully deploying AI models system to our AI workstation. Thereafter, we designed an in vivo ani-
in the operating room is still an open question. mal study to showcase the potential of AI-Endo in standard clinical
As a pivotal role in maintaining high accuracy of phase recogni- setup. A user-friendly interface was developed that could visualize real-
tion, dataset quality drives the learning process of AI models with time recognition of surgical workflow and automatically generate a
representative samples and universal features. Different from the summary report for data analysis toward surgical skill assessment. This
principles in traditional video-based action recognition21–23, expert study sheds light on automated surgical workflow recognition with
knowledge could impact on the modeling of operational patterns in validation in real-time pre-clinical settings for ESD.
ESD surgery24, thus determining the applicability of the AI model to
various cases in the stage of clinical deployment. Therefore, develop- Results
ing surgical AI models has a greater need for establishing an expert Developmental dataset for model training
dataset that covers the changes in anatomic targets, surgical tools, and Forty-seven endoscopy videos with full-length ESD procedures (dura-
how a tool is manipulated by surgeons25. Standardization and expertise tion 71.28 ± 36.71 min) recorded from the Endoscopy Centre of Prince
of the dataset can not only provide typical samples that commonly of Wales Hospital in Hong Kong were used as the training cohort. All
occur in ESD therapeutic procedures26 but also facilitate future cases were performed by an expert who has over a decade of experi-
downstream analysis based on the recognition results27. The con- ence in ESD. Expert procedure videos were chosen as training material
struction of such a dataset, however, still remains to be completed due as AI models treated the dataset as a gold standard, and the demon-
to the scarcity of experts as well as annotation protocols28. strated endoscopic and device maneuvering skills should represent
Despite that surgical data science has been studied for a while29, expertise for operations in safety-critical situations. The dataset cov-
experimental validation of deep learning models in real-world complex ered a long period from July 2008 to March 2020. The videos were
scenarios and/or real-time pre-clinical settings is still extremely lim- recorded using the endoscopy video processor (CV-260 and CV-290,
ited. Existing literature still lacks systematic experiments on how the Olympus Medical Corporations, Tokyo, Japan), with a resolution of
developed AI models are validated given various surgeon expertise 352 × 240 or 720 × 576 at 25 fps and a resolution of 1920 × 1080 at 50
(e.g., from novices to experienced ones), long-time data expansion fps (i.e., frames per second). This yielded up to 3GB file size for each
(i.e., surgical instruments change over time) and across surgical sites single case and millions of frames in total for the overall dataset. All
(from retrospective human data to ex vivo/in vivo animal trials). All patients’ sensitive information including ID, sex, and age was de-
these factors would introduce data distribution shifts and are impor- identified and patient consent forms were waived for the retrospective
tant to be experimentally considered because they may degrade the cohort. IRB has been approved by the ethics committee of The Chinese
generalizability of data-driven models. In addition, how to incorporate University of Hong Kong.
such automated data analysis in a way that fits into clinical workflow The included cases cover a wide variability of lesion sizes, loca-
and fulfills clinical needs is non-trivial and unclear. In these regards, tions (i.e., rectum, stomach and esophagus) and surgical tools (i.e.,
systematic experiments, even live animal studies, are necessary to dual/isolation-tipped/triangle-tipped knife). More details about the
experimentally verify the effectiveness of AI models for real-world variability of the dataset are provided in Supplementary Table 1.
clinical applications. Some works have explored possibilities to Although the dataset spans a long time of 12 years, for the whole
incorporate intelligent functions in applications of procedural skill period the endoscopist has already achieved the level of expertise. At
assessment30,31 and future frame prediction32 through in-silico experi- the start point (year 2008) of the cohort duration, the endoscopist had
ments, however, these works were limited to using surgical data ana- conducted more than 100 ESD cases on each organ of rectum, sto-
lytics in an offline mode, rarely considering the efficiency of mach, and esophagus. According to the learning curve reported in
burdensome models in practice. For the advancement of the clinical refs. 33–35, the endoscopist can be treated as an expert because the
value of AI models, experimental results in real-world settings are number of conducted cases is higher than the suggested bar (80/30/
frequently suggested in smart healthcare-related guidelines. To date, 30 cases of rectum/stomach/esophagus, respectively). Annotation was
there is no reported work on validating AI models in live animal pre- performed on all of the retrospective datasets for surgical phase
clinical settings for ESD. recognition.
In this study, we proposed a deep learning-based method (named
AI-Endo) for intelligent surgical workflow recognition in ESD. To Annotation protocol of ESD workflow
achieve accurate phase recognition and real-time clinical deployment, To annotate the developmental dataset and external validation data-
we introduced a cascade of feature extraction and fusion modules with set, we propose a standardized ESD annotation protocol (see Fig. 1a).
the ability of spatial-temporal reasoning. As the endoscopy video Four surgical phases have been defined: (1) Marking: the periphery of
streams into the framework, it could not only extract representative the target lesion would be identified, then marking would be per-
frame-wise features but also distill temporal relations to describe formed by applying multiple electrocautery marks circumferentially
complicated surgical scenes. Furthermore, we designed the frame- around 5 mm away from the lesion at 2 mm intervals; (2) Injection:
work with a light yet compelling feature backbone and dynamic fea- submucosal elevation would be achieved by injection of a mixture of
ture fusion to accommodate the trade-off regarding test efficiency. solutions containing normal saline, epinephrine, or hyaluronic acid

Nature Communications | (2023)14:6676 2


Article https://doi.org/10.1038/s41467-023-42451-8

a b !"#$%&'
Marking Injection .%//*+,%-&
Dissection (01*
Idle
Phase Image Example Annotation Protocol

Surgical knives, including triangle-tipped knife, insulation-tipped knife and dual knife, get close to the tumor
and place the knife onto the surface of mucosa;
Marking Coagulation marks are labelled around 2mm apart and 5mm vertical to the tumor;
Start frame: Surgical knife is placed on the surface of mucosa;
End frame: Surgical knife is retrieved or not in contact with mucosa for more than 3 consecutive frames. c
The injection needle is injected into the working channel, and the tip of tool is exposed and inserted into the
ESD surgical Marking Injection Dissection Idle Total labelled
submucosal layer;
phases frames
After sufficient solutions, e.g., hyaluronic acid and dextrose water, are injected from the probe into the
mucosa, the injection tool is retracted; Developmental
Injection 4,679 14,454 111,836 70,057 201,026
The tumor is lifted for better observation; training data
Start frame: Injection needle is inserted into the submucosal layer;
External 3,271 7,000 65,195 91,061 166,527
End frame: Injection needle is retrieved from the submucosal layer or not in contact with submucosal layer
validation data
for more than 3 consecutive frames.

Number of annotated frames


The surgeon sets the electrosurgical unit (ESU) and make sure the cutting surface of the knife (e.g., Developmental data
surgical knives, including triangle-tipped knife, insulation-tipped knife and dual knife) is exposed; External data
The surgeon controls the knife to cut around the circumference of the tumor, during which injection might be
8,000
repeated to keep enough elevation;
Dissection After the dissection is completed, the lesion is removed with graspers;
6,000
Start frame: The knife is used to cut the circumference of the tumor;
End frame: The knife stops cutting or is not in contact with the soft tissue for more than 3 consecutive
frames.
4,000

Phase Idle is defined as the hold-on time for the endoscopist to change the surgical tools, adjust the angle
of scope and clean the field of view; 2,000
The time used by the surgeon for inspecting the status of procedure and making decisions on following
Idle procedures;
Start frame: When surgical tool has been retrieved or hovered the tool for more than 3 consecutive frames; 0
End frame: Any one of phases Marking, Injection, and Dissection begins. Marking Injection Dissection Idle
Phase

Fig. 1 | Establishment of developmental and external datasets for ESD surgical data and external validation data, and the corresponding violin plot on the dis-
phase recognition. a Illustration and definition of the four surgical phases: tribution of annotated frames. The box indicates the median as a white point in the
Marking, Injection, Dissection and Idle. Examples of start and end frames are pro- box and excludes the upper and lower 25% (quartiles) of data, and the whiskers
vided in Supplementary Fig. 2. b Phase annotation examples of surgical videos. extend to the extrema (Developmental data n = 47 cases; External data n = 15 cases).
c The statistical numbers of four annotated frames in our developmental training Source data are provided as a Source Data file.

using needle injector. Due to the difficulty in retrospective annotation phase, with the phase of Dissection occupying most of the surgical
and often ultra-short duration of 1–2 s, transient saline injection time, which is also the most important and skill-demanding phase in
through the channel within the electrosurgical knives could not be ESD. Detailed statistics of each phase are listed in Fig. 1c. Overall, a total
separately annotated and thus would be included in the Dissection of 201,026 and 166,527 frames were labeled for the developmental
instead of Injection phase; (3) Dissection (mucosal incision and sub- training and external validation (described below) datasets. All the
mucosal dissection): mucosa around the marking point is incised, and annotations in this study followed the same annotation protocol.
then submucosal layer would be dissected from the underlying mus-
cularis propria until the target lesion is resected and removed. External datasets for model validation
Hemostasis with electrosurgical knives is included because of its short Given the complexity of anatomical scenes and the variety of proce-
duration; (4) Idle: the hold-on time spent by the endoscopists to dures, it is critical to validate the applicability of the AI model to dif-
exchange the instruments or adjust the endoscope. Each single frame ferent endoscopists and operation skills. To this end, we first collected
is only labeled with one of the four phases, which is determined based 15 cases of ESD performed in Prince of Wales Hospital in Hong Kong
on identifying the start and end frame of each phase as well as its from April 2021 to August 2022, and 122,114 frames in total were
temporal continuity36. annotated at 1 fps. These procedures were conducted by three
For all the cases, we excluded frames after the tumor was com- younger endoscopists with 6, 3 and 2 years of experience in ESD
pletely resected. We also downsample the video to 1 fps for annotation respectively. Different from the developmental dataset that con-
efficiency. To ensure high-quality data, the annotation workflow con- centrates on data from an expert clinician with stable and proficient
sisted of three stages. First, two well-trained medical annotators surgical performance, the validation data aims to reflect the variance in
independently annotated approximately 10% (5 cases, 20,446 frames) surgical skills in order to evaluate the model’s generalizability and its
of the expert dataset based on the dataflow in Supplementary Fig. 1. potential to support skill assessment in clinical practice such as
The inter-rater agreement was measured using the Pearson correlation training sessions. The variation in the endoscopists’ experience in the
coefficient (PCC)37, which was 0.93. This showed a high consistency of validation dataset helped to assess the AI model’s tolerance to human
labeling between the two raters leveraging our provided annotation factors that are commonly associated with different levels of expertise.
protocol. Supplementary Fig. 3 provides an example of the annota- This is especially important for ESD which is a typical procedure
tions from the two raters. Then, the two raters jointly labeled the whole requiring a long learning curve38.
dataset by dividing all the cohorts into approximately equal halves, We also conducted further validation of our AI model in ESD
with each rater individually annotating one part. After they completed procedures with unseen operation techniques, i.e., the surgical skills
all of their annotation tasks, the annotations underwent quality control were not present in the developmental dataset. Three ESD cases with
by another two experienced endoscopists. The annotation assessment the pocket creation method39 and one case with line-assisted traction
relied on not only visual cues but also practical experience to deter- method40 were acquired from Prince of Wales Hospital for the pur-
mine the surgical phase. Discussions happened in situations where the pose. In particular, the pocket creation technique is used to improve
surgical site was highly complex or key landmarks were not clearly the visualization of the dissection plane by creating a pocket in the
seen. Details on the dataflow, annotation schedule and annotation submucosal layer after making a small mucosal incision for entry. The
results are provided in Supplementary Note 1. Three final annotation traction method involves using additional instruments (such as a clip
examples (with different video durations) are shown in Fig. 1b. The with a line, snare or other commercially available traction devices) to
number of annotated frames in the expert dataset varies across each apply counter traction. This dataset overall has 19,254 annotated

Nature Communications | (2023)14:6676 3


Article https://doi.org/10.1038/s41467-023-42451-8

a 1.0 b
Phase-wise metric AUROC Specificity Sensitivity Orderliness
True Positive Rate 0.8
Marking 97.69 (94.37, 100.0) 97.57 (95.11, 100.0) 94.45 (88.39, 100.0) 97.56 (95.11, 100.0)
0.6 Injection 98.40 (96.48, 100.0) 95.98 (92.94, 99.02) 97.38 (96.63, 98.14) 96.16 (93.43, 98.89)
0.4 Dissection 97.85 (96.73, 98.97) 93.91 (92.94, 94.89) 95.10 (93.36, 96.85) 94.59 (93.47, 95.72)
ROC curve of Marking
ROC curve of Injection
0.2 Idle 96.69 (95.94, 97.44) 93.27 (92.09, 94.44) 91.31 (90.00, 92.62) 92.57 (91.58, 93.56)
ROC curve of Dissection
ROC curve of Idle
Average 97.66 (95.88, 99.10) 95.18 (93.27, 97.09) 94.56 (92.10, 96.90) 95.22 (93.40, 97.04)
0.0
0.0 0.2 0.4 0.6 0.8 1.0
c False Positive Rate d Idle Marking Injection Dissection

Idle

Annotation
Marking

Injection

Dissection

Marking Injection Dissection Idle


Prediction

Fig. 2 | Analysis results of 5-fold cross-validation on the developmental data- are presented as 95% confidence interval; c Examples frames in four phases with
set. a The receiver operating characteristic (ROC) curve of four phases; b Statistical intra-class difference and inter-class similarity; d The confusion matrix across four
scores of AI-Endo on four phases based on the Youden Index (n = 47 cases). Data surgical phases. Source data are provided as a Source Data file.

frames. Investigation into these techniques helps observe the perfor- positive (TP), true negative (TN), false positive (FP) and false negative
mance of the AI model when encountering novel styles of tool-tissue (FN). For the overall performance, we adopt three commonly used
interactions. criteria, i.e., average accuracy, average precision and average recall.
In addition, we designed ex vivo and in vivo animal trials for the The average accuracy (TP + TP + TN
TN + FP + FN ) captures the overall ratio of cor-
purpose of validating the integration of the AI model into the existing rectly classified frames. The average precision (TPTP + FP ) and recall
endoscopic system. For initial feasibility validation, we conducted four (TPTP
+ FN ) deliver the fraction of relevant samples in all retrieved samples
ex vivo animal trials to streamline the data flow of AI assistance into the and the completeness of the relevant collection. Moreover, to inspect
standard ESD workflow. Afterward, we conducted in vivo experiments the performance of AI-Endo on each phase, we plot the receiver
for validation of the whole system in real time. A total of 12 ESD pro- operating characteristic curve (ROC) and evaluate the results of the AI
cedures were performed on two live pigs in a surgical training session. inference with the area under the ROC (AUROC)41. Meanwhile, we refer
Furthermore, we collected external multi-center datasets to vali- to a summary measurement of the ROC curve, Youden Index42,43, to
date the generalizability of AI-Endo on different endoscopy systems apply the optimal threshold for phase prediction, yielding a set of TP, ^
and demographics. The first cohort contained four ESD cases from ^ ^ ^
TN, FP and FN for each phase, which are used to calculate the speci-
^ ^
Nanfang Hospital, Southern Medical University, Guangzhou, China. ficity TN TN
^ and sensitivity TP
^ + FP
TP
^ + FN^ of each phase to keep coincident with
This dataset was collected from a Fujifilm endoscopic system, in order the Youden Index and ROC curve. Moreover, we define the orderliness
^ + TN
TP ^
to validate applicability for different imaging devices. The second metric (TP ^ + TN
^ + FP ^ ) for phase-wise evaluation to measure the degree
^ + FN
cohort contained four ESD cases from Internal Medicine III-Gastro- of how the target frames are correctly ordered for each phase. Details
enterology, University Hospital of Augsburg, Augsburg, Germany. This on this metric are provided in Supplementary Note 2.
dataset was collected from an Olympus endoscopic system which is For evaluation results of 5-fold cross-validation on developmental
the same as our developmental data, while the purpose of this external dataset, our AI-Endo model obtains an average accuracy of 91.04% (CI:
dataset is to validate the model’s generalizability on patients from a 89.57%, 92.51%), average precision of 88.48% (CI: 85.98%, 90.97%) and
different country. These two cohorts were labeled according to our an average recall of 88.77% (CI: 85.99%, 91.54%). The high performance
annotation protocol and yielded 25,159 frames in total. is attributed to the representative features learned from expert sur-
gical videos. For the performance of AI-Endo for each phase, Fig. 2a
Performance of AI-Endo model using 5-fold cross-validation on shows the ROC curves of the four ESD phases, with specific AUROC
developmental dataset scores as 97.69% (CI: 94.37%, 100.00%), 98.40% (CI: 96.48%, 100.00%),
For automated ESD surgical phase recognition, we propose a deep 97.85% (CI: 96.73%, 98.97%) and 96.69% (CI: 95.94%, 97.44%) for
learning-based framework called AI-Endo, which inputs the video Marking, Injection, Dissection and Idle, respectively. In general, for all
stream and embeds each frame into high-dimensional feature space. four phases, the specificity, sensitivity, and orderliness are all higher
To sufficiently make use of temporal information for accurate model than 90% (see detailed results in Fig. 2b). This demonstrates the
performance, we incorporate a cascade of feature extraction with a model’s promising performance in accurately predicting ongoing
temporal convolution network and a global attention-based transfor- surgical phases from a complex procedure. It is worth noting that the
mer to extract spatial-temporal features. Our AI-Endo is developed ESD surgical scenes have significant intra-class variance while con-
based on the 47 training cases in 5 folds (with sizes of 10, 10, 9, 9, 9), siderable inter-class similarity. Figure 2c demonstrates some success-
each one of which is used for performance evaluation while the other 4 fully recognized frames from each phase under such challenges. For
folds are used for training the learning algorithm. Without loss of example, in phase Dissection, the trajectory of dissection as well as the
generality, this cross-validation strategy enables the developed fra- dissected surface of the submucosal layer often present variations,
mework to be validated on the whole developmental dataset. making phase recognition difficult. Simultaneously, the tasks of phases
The phase prediction can be obtained by taking the maximum or Marking and Injection show similarity in the interaction between the
setting an optimal threshold on the output probabilities. Both overall surgical tool and the surrounding tissues, such as the insert on the
and phase-wise metrics can be derived from four collections, i.e., true mucosa layer and the retraction away from the target point. Despite

Nature Communications | (2023)14:6676 4


Article https://doi.org/10.1038/s41467-023-42451-8

Fig. 3 | Experimental results on validation dataset with different surgeons and surgeons A, B, and C; c Illustration of different dissection tools used in develop-
skills. a Phase recognition accuracy of AI-Endo on n = 15 validation ESD cases mental data and external data from different surgeons and skills; d Line-assisted
conducted by different surgeons. Each bar represents one case; b Proportion of traction tool used in external data of ESD traction technique. Source data are
phase duration and frequency of phase transition (orange-colored timestep) for provided as a Source Data file.

that these situations may lead to the misclassification of the AI model For validation on another four cases that involve operation skills
on similar frames from different phases (see the confusion matrix in that are unseen in the developmental data, AI-Endo shows an average
Fig. 2d), the proposed AI-Endo model still retains a remarkable per- accuracy of 93.07% (CI: 83.44%, 100.0%) on cases with pocket creation
formance to distinguish them. method. AI-Endo retains the ability to recognize surgical phases in the
pocket creation process, even though the pocket creation is relatively
Performance of AI-Endo model on validation datasets with dif- new and not included in our developmental dataset. This advantage is
ferent surgeons and skills largely attributed to its potential to capture features of tissue back-
The advantages of a learning-based framework are substantially ground and tissue-tool interactions, which are shared between con-
attributed to its ability to recognize surgical actions and learn intrinsic ventional operations and pocket creation. The accuracy on ESD with
features from surgical video data. For AI-Endo, its modeled spatial line-assisted traction was lower at 75.22% (CI is not calculated for one
embedding and temporal relationships enable it to address various case). The limitation in accuracy was caused by the emergence of new
situations. For evaluation on different surgeons, we have tested the AI- functional tools (Fig. 3d) during traction application, which coincides
Endo model on 15 external patients conducted by three endoscopists with our expectation because it is challenging to be applicable to a
with different levels of ESD experience. The model yields an average specialized tool that looks very different from others. Our model
accuracy of 90.93% (CI: 88.52%, 93.33%) for Surgeon A (6 years predicts phase Idle for the frames involving this tool while predicting
experience), 92.93% (CI: 89.81%, 96.04%) for Surgeon B (3 years), and correctly for other frames in general.
92.28% (CI: 82.96%, 100.0%) for Surgeon C (2 years). Phase-wise
metrics on these 15 cases are provided in Supplementary Table 2. Ex vivo animal study for validation of AI-Endo model
Specific results for each case conducted by these three surgeons are Existing works on surgical phase recognition have not yet clearly
shown in Fig. 3a. These results on different endoscopists demon- investigated the incorporation of the AI-Endo model into clinical
strated the generalization capability of the AI-Endo method to workflow, therefore, we designed an ex vivo animal study to optimize
accommodate the variation in skill levels of ESD procedures. Such and validate the proposed framework in our work, ranging from the
variation affects the proficiency and smoothness of the procedure, layout of third-party monitors to the design of the graphical user
which can be reflected by the duration of each surgical phase and the interface. Compared to conducting an in vivo animal study directly,
transition frequency between them (see Fig. 3b). In addition, the ESD adopting a preliminary ex vivo study first is more cost-effective to
instruments used in the external validation data are not identical to the ensure the AI assistance could deliver useful data analytic results and
expert developmental data, because the design and utility of ESD alleviate interruptions caused by add-on AI functionality. To confirm
instruments were evolving over time. As examples illustrated in Fig. 3c, how to seamlessly integrate the AI-Endo computational tool into the
the ESD knives from the developmental dataset included dual knife, Endoscopy System, we implemented the whole system in a training
insulated-tip (IT), and triangular tip (TT) (Olympus Medical Corpora- laboratory at CUHK Jockey Club Minimally Invasive Surgical Skills
tions, Tokyo, Japan), while the external validation dataset also used the Centre. Specifically, after the ex vivo porcine colon was cleaned by
updated needle-type knife besides the dual knife and IT. The AI-Endo water lavage, it was fixed within a plastic tray, then an overtube was
model can overcome such variation with stable performance regard- attached to the colon to simulate the environment inside the colon.
less of different instruments, showing that its discrimination capability Figure 4a shows the entire system pipeline and data flow, i.e., the
mainly relies on understanding dynamic surgical actions rather than surgical operation on the animal model was imaged by an endoscope
instrument appearances. and streamed by an endoscope processor; the video is imported to the

Nature Communications | (2023)14:6676 5


Article https://doi.org/10.1038/s41467-023-42451-8

a
b
C D
Olympus screen

Animal model Endoscope Endoscopy


E
A B processor D C
AI summary report
D E
AI-Endo framework
C
c 40

E
30
B
Processing time / ms

20

A
A
10

0 Ex-vivo animal trial settings In-vivo animal trial settings


IO ResNet50 TCN Transformer Overall Animal trial

Fig. 4 | Experimental settings and real-time performance of pre-clinical animal party screen which delivers data analytical results. In addition, c shows the pro-
experiments. a The data flow of the entire system which integrates AI-Endo with cessing time for the overall system and the breakdown of each individual key
the existing clinical Olympus Endoscopic system. Each individual component is technical part. Results were calculated based on n = 13,341 frames. The box indi-
correspondingly marked in (b) for both the ex vivo (left) and in vivo (right) cates the median as a line in the box and excludes the upper and lower 25%
experimental settings, with "A" as animal model, "B" as endoscope, "C" as endoscopy (quartiles) of data, and the whiskers extend from the box by 1.5 × the inter-quartile
processor for video streaming, "D" as existing Olympus screen, "E" as AI-Endo third- range (IQR). Source data are provided as a Source Data file.

AI-Endo model and the automatic analytical results are displayed to To support the clinical usage of AI-Endo, we packaged the AI-Endo
surgeons in real-time. as a desktop software that seamlessly operates with prevalent surgical
We put a third-party monitor (aside from the existing displaying settings. The accessibility of AI-Endo becomes much reachable for
screen of the endoscopic view) for surgeons to visualize the AI- endoscopists who are more likely to demand a ready-made imple-
predicted surgical phase on the screen, in which the surgical phase was mentation with a user-friendly graphical interface. In the animal trial,
overlaid to each frame in top left corner without occluding the main multiple 2-cm-sized lesions were marked for simulated ESD on three
surgical scene (see Fig. 4b). We measured the computation overhead different locations in the digestive tract, i.e., rectum, stomach, and
for I/O data flow, which totally took 4 ms for data flow input (i.e., esophagus. Twelve ESD procedures were performed on two live pigs,
importing video stream from existing surgical system to AI-Endo) and i.e., including five (1/2/2 for rectum/stomach/esophagus) and seven (2/
output (i.e., displaying the AI-Endo prediction phase to screen for 3/2 for rectum/stomach/esophagus) were conducted by an experi-
surgeons to visualize). The AI-Endo model inference took 17 ms, con- enced and a novice endoscopist, respectively. The AI-Endo delivered
sisting of 6 ms for the ResNet50 module, 3 ms for the Fusion module, an average accuracy of 83.53% (CI, 81.48–85.58%) over all the in vivo
and 8 ms for the Transformer module (cf. details of the AI model procedures. The relative performance degradation was postulated to
architecture in “Methods”). Note that the transformer module uses the be due to anatomical differences between pig and human tissue, as
most time because it needs to aggregate crucial spatial-temporal well as the experimental setting of fake lesions. Fortunately, for the
information for maintaining recognition accuracy. Overall, the effi- Dissection phase which is the most important step for ESD, the AI-Endo
ciency of the entire AI-Endo recognition system reached 47 fps, which achieved a specificity of 91.57% (89.89%, 93.24%) and sensitivity of
can satisfy the requirement for real-time use, without feeling of visual 86.68% (CI, 83.22%, 90.14%) (Supplementary Table 4). Additionally, AI-
latency. Two stations were set up using the above-described ex vivo Endo achieved accuracy rates of 83.29% (CI: 77.43%, 89.15%), 83.05%
setting, with each serving two novices in the training session. For the (CI: 78.11%, 87.99%) and 84.31% (CI: 78.77%, 89.85%) on rectum, sto-
four trainees, our AI-Endo yielded an average accuracy of 88.88% (CI: mach and esophagus, respectively, showing slight differences among
79.95%, 97.82%) over a total of four cases, showing potential to apply different GI organs.
the AI model in a streamed ex vivo setting as a holistic system. Phase- The in vivo animal experiment aimed to serve as a promising pilot
wise metrics indicate high sensitivity and specificity regarding the key study to explore the applicability and capability of AI-Endo for cogni-
phases of Injection and Dissection (Supplementary Table 3). tive assistance in real-time complex surgery. In this regard, we tried to
derive meaningful skill assessment scores from AI-based workflow
In vivo animal study on live pigs for validation of AI-Endo model recognition results, to automatically analyze the operational skills of
in pre-clinical setting novices during the training session. As shown in previous works44, the
Quantities of works have been proposed for automated surgical phase surgeon with a higher level of surgical skill tends to operate the sur-
recognition, however, none of them incorporated in vivo animal trials gical tools more smoothly, which benefits from their clear plan on the
to demonstrate the clinical application of system in real-world surgery. trajectory of surgical tools and the resection surface. To some extent,
Based on the success of ex vivo animal experiments, we further con- the smoothness of the operation can be reflected by the frequency of
ducted an ESD surgical training session with in vivo live animal trials, hesitation and the exchange of surgical tools45, which can be quantified
aiming to showcase the clinical applicability of an intelligent phase by the frequency of the surgeon changing across phases. In ESD
recognition system with online score analysis and automatic perfor- training session, it is useful to monitor their operational skill in real
mance report generation. The real-time system integration and data time, which could reflect their learning curve. In this regard, the AI-
flow of the in vivo experiment were the same as that of the ex vivo Endo system dynamically counted the number of transitions among
experiment. surgical phases, e.g., the transition from Dissection to Idle when the

Nature Communications | (2023)14:6676 6


Article https://doi.org/10.1038/s41467-023-42451-8

a × 10−3 b

25 Novice
Date: 2022/10/18 Hospital name: Prince of Wales Hospital Usage: Surgical training
Patient name: Animal Surgeon name: Hon-ch Animal trial
Expert Bed: 1 Note: In-vivo animal trial Clinical surgery
Normalized Transition index

20

15

Analytic results based on surgical phase


Note: The transition Transition index is an online score to reflect the smoothness of surgical operation.
Each exchange between

10 Transition matrix
Current frame

Transition index

Current frame
5
Time / s
The change of Transition index over time

Ranks Overall rank: A out of 4


Note on rank:
0 Idle period / Tumor size: A

Idle period / Dissection period: A


A: fluent and smooth operation
B: efficient move with some
unnecessary moves
C: frequently stop and unsure of
Procedure duration / Tumor size: B next move

500 1,000 1,500 2,000 2,500


Time / s Summary report

Fig. 5 | Data analytical results derived from AI-Endo phase recognition for represent the dissected samples and the scale bar corresponds to 1 cm; b Design of
in vivo animal experiments. a The curves of derived online score of the Nor- the AI summary report that is automatically generated by the AI-Endo system in pre-
malized Transition index calculated for the senior surgeon (2 cases with orange clinical trials. Source data are provided as a Source Data file.
lines) and the novice (2 cases with brown lines) at the esophagus. Inserted photos

knife retracts. As the size of the lesion could affect the total duration of were conducted using the Fujifilm endoscopic system, which differs
procedure46, we divided the total transition frequency by the length of from the Olympus endoscopic system utilized in our developmental
the tumor to remove its bias on the number of phase transitions. The dataset. To evaluate the potential of AI-Endo in an international cohort,
proposed online surgical score Normalized Transition index (NT- we further tested AI-Endo on an additional four cases from Internal
index) is defined as the division of transition number over time and the Medicine III-Gastroenterology, University Hospital of Augsburg, Augs-
size of the lesion, which yields an NT-index curve to describe the burg, Germany. These cases were recorded with the Olympus system but
dynamic changes of transition frequency as the operation proceeds. yielded geographical variations across different countries.
The lower this curve is, the higher the endoscopist’s skill level would We utilized AI-Endo to process the four cases from Nanfang
be. In Fig. 5a, we present the index curves of four in vivo surgical cases, Hospital, Southern Medical University in Guangzhou, China. All cases
two of which are respectively conducted by the experienced and the were annotated and processed in the same manner as the develop-
novice endoscopist. The analytic results show that the NT-index curves mental dataset. AI-Endo finally yielded an average accuracy of 90.75%
of the senior are generally lower than those of the novice. At the end of (CI: 88.50%, 93.01%) and exceptional ROC curves for each phase
the procedures, the senior and the novice get normalized transition (Fig. 6a). All phase-wise performance metrics were higher than 88%
index scores on the rectum, stomach and esophagus as (13.94 vs 21.71), (Fig. 6d). This investigation shows that AI-Endo’s performance is
(4.39 vs 10.72) and (10.23 vs 16.85). The proposed online score NT- robust and generalizable across different endoscopy systems, which
index shows a statistical difference (p = 0.048) in the level of ESD skills, aligns with our expectations regarding ESD surgical settings. During
which is consistent with our expectations according to the animal trial endoscopic procedures, conventional white light images are used and
settings. Based on the index curve, expert endoscopists, e.g., the these images remain largely consistent across different brands of
trainer in surgical training, could provide advice and supervision on endoscopes. Additionally, the design and implementation of intelli-
specific surgical steps. gent algorithms in the development of AI-Endo did not depend on
In addition, we propose to automatically generate an intelligent assumptions about the type of instrument being used. AI-Endo can
report, which summarizes and presents the surgical workflow analy- accept the video stream and process data in a relatively independent
tical results to the endoscopists. As shown in Fig. 5b, the summary manner, which means the inference speed should not be heavily
report intuitively visualized the duration and ratio of each phase. Dif- dependent on the endoscopy system.
ferent from the manual annotation or repeated derivation in previous Then, four cases from Internal Medicine III-Gastroenterology,
works9,36, the AI-Endo instantly provides the endoscopist with an University Hospital of Augsburg, Augsburg, Germany were used to
overview of the surgical process and also details the factors that might showcase the robustness of AI-Endo under geographical variations.
reflect the surgical skill, such as the duration of phase periods and their Although the cases were conducted at an international center, AI-Endo
corresponding ratios in each endoscopist. The proposed online score maintained its high performance and achieved an average accuracy of
of the Normalized Transition index, together with several straightfor- 87.34% (CI: 84.43%, 90.25%), specificity of 86.01% (CI: 71.48%, CI:
ward offline scores added in the summary report, is supposed to serve 96.27%), and average sensitivity of 86.60% (CI: 74.21%, 96.36%). AI-
as an essential reference for the investigation of procedural knowledge Endo delivers promising ROC curves on four phases with AUROC
and decision-making skills, taking a leap forward to the potential values exceeding 90.67% (Fig. 6b, d). Based on the multi-center dataset
clinical applicability of AI-Endo. from Guangzhou (China) and Augsburg (Germany), we further statis-
tically analyzed the performance of AI-Endo on different organs,
Multi-center validation on data from different endoscopic sys- including esophagus, colorectum, and stomach, on which AI-Endo
tems and country keeps a large average accuracy higher than 86.68% (Fig. 6c). These
To broaden the application of AI-Endo, it is interesting to observe its findings suggest that AI-Endo can robustly handle multi-center cases
generalizability to different endoscopic systems and multi-centers. We regardless of differences in their geographical or tumor locations,
assessed the performance of AI-Endo using four cases from Nanfang indicating the potential of AI-Endo for wide applications across inter-
Hospital, Southern Medical University in Guangzhou, China. These cases national medical centers.

Nature Communications | (2023)14:6676 7


Article https://doi.org/10.1038/s41467-023-42451-8

Fig. 6 | Experimental results on multi-center validation datasets from metrics of AI-Endo on multi-center datasets. Data are presented as 95% confidence
Guangzhou (China) and Augsburg (Germany). a The ROC curve of AI-Endo on interval (CI) when applicable. CI is not calculated for the Marking phase in cases
cases from Guangzhou, China (n = 4 cases); b The ROC curve of AI-Endo on cases from the center in Guangzhou, China, because only one case involves marking.
from Augsburg, Germany (n = 4 cases); c Average accuracy on cases of esophagus, Source data are provided as a Source Data file.
colorectum, and stomach in the multi-center datasets; d Phase-wise performance

Discussion existing Olympus system is maximum at 50 fps, from our human


This work aims to investigate intelligent surgical phase recognition feedback, we did not feel visual latency when using the provided user
from bench to bedside. We established a high-quality ESD dataset of interface. This demonstrated that the AI model can fulfill real-time
expert operations, together with a well-defined annotation protocol requirements given the hardware support of a standard workstation-
for surgical phase recognition. Based on it, we developed the AI-Endo level configuration. It could suggest the potential of applying
model to recognize surgical phases with representative spatial- advanced surgical AI tools in low-income countries.
temporal features, achieving high performances on both develop- Regarding how to properly incorporate the AI-Endo model into
mental and external validation datasets. This demonstrates that the AI- the existing clinical workflow, we in fact had multiple rounds of
Endo model trained on expert data is applicable to junior surgeons discussion and optimization among engineering and clinical team
with different skill levels, various cases with different ESD techniques members. Existing literature on computer-assisted surgery in gen-
and endoscopic systems. More importantly, the AI-Endo was seam- eral has not yet clearly investigated this important issue. Basically,
lessly integrated into pre-clinical settings, and validated with ex vivo we think that at least two points should be considered for the inte-
and in vivo animal trials in real-time. The system showed stable per- grated system design. The first is to ensure that the system delivers
formance, and analytical results were delivered to surgeons through a useful data analytic results that are otherwise not obtainable without
user-friendly interface for intraoperative cognitive assistance and AI assistance. The second is to avoid the add-on AI functionality
postoperative training assessment. changing the surgeon’s operation habit in the current routine. In
ESD is a novel endoscopic surgical procedure for complete these regards, we propose to display the AI predictions on a third-
tumor resection to cure early gastrointestinal (GI) cancer, which is party screen putting it side-by-side to the existing Olympus screen.
the most common cancer worldwide. Although ESD has good peri- The ongoing surgical phase is monitored by the AI-Endo behind the
operative outcomes regarding high-rate of en-bloc resection and curtain, which presents steady progress of the procedure. More
low rate of local recurrence, the surgery is still challenging with a importantly, we derive an online score based on the surgical phase
long learning curve for novices. It is clinically desired to use AI recognition for skill assessment and apply it to the ESD training
techniques that can learn from expert experiences and data for session. This score is automatically calculated to reflect the profi-
understanding surgical contexts and further identifying, preventing, ciency and smoothness of ESD. Despite it is not yet thoroughly
and mitigating safety-critical events in operation. To begin with, validated from clinical usage, we regard this as an inspiring initial
surgical phase recognition is the fundamental task, i.e., only after the step for driving AI’s role in facilitating novice surgeons. In our future
ongoing surgical step is automatically recognized can the smart work, we target integrating the AI-Endo into the endoscopy system
system conduct subsequent functionalities. Existing works have not as off-the-shelf software, displaying the analytic results on the
systematically investigated this key task due to the lack of expert embedded monitor in a straightforward manner.
data, algorithmic limitations, and insufficient pre-clinical validation. Limitations of our work lie in two aspects. First is the relatively
This study plays a pioneering role to raise attention and inspire small number of cases in developmental dataset, which is actually a
solutions for AI-assisted ESD. common drawback of most existing works on surgical AI. The cur-
As observed from experimental results, our AI-Endo model suc- rent largest public dataset, i.e., Cholec80 on laparoscopic
cessfully addressed the dilemma between accuracy and efficiency for cholecystectomy47, has 80 full-length surgical videos at a high frame
surgical workflow prediction in ESD. Using an inference computer rate. The small-scale training data is still not comparable to the big
equipped with an Intel Xeon(R) 3.7 GHz CPU with one NVIDIA GeForce data as used in other deep learning applications such as face
RTX 3090 GPU, the model is able to yield a good online deployment recognition and autonomous driving. Fortunately, our collected
accuracy at 47 fps. Noting that such an efficiency includes time spent data was of high quality in terms of expert skill level, long-time
throughout data analytics in the integrated system, rather than the AI expansion, various dissection locations, and diverse surgical scenes,
model computation itself. Given that the raw data streaming in the which helped to compensate for the shortage. The clearly defined

Nature Communications | (2023)14:6676 8


Article https://doi.org/10.1038/s41467-023-42451-8

Fig. 7 | The architecture of AI-Endo deep learning model for real-time recog- information. Thereafter, spatial embedding at t is used as the query in the
nition of surgical phases. Each frame of video stream is sequentially encoded by transformer-based module for predicting the frame-wise surgical phase. Different
ResNet50, followed by a temporal convolution network to fuse spatial-temporal colors represent different feature embeddings or output values.

annotation protocol was also important to ensure the labeled 0.2 Methods
million training frames were consistent as ground truths for model Data collection
learning. The second limitation of this work concerns the model In this study, developmental dataset was collected from Prince of
generalizability, which was noticeable from the performance drop Wales Hospital in Hong Kong and validation datasets were gathered
observed in ex/in vivo animal experiments (Supplementary Tables 3 from Prince of Wales Hospital; Nanfang Hospital, Southern Medical
and 4). Despite this being explained by the appearance difference University in Guangzhou, China; and Internal Medicine III-Gastro-
between animal tissue and human tissue, similar degradation is enterology, University Hospital of Augsburg in Augsburg, Germany.
anticipated to be encountered under the emergence of new tools All patient information in these retrospective cohorts was de-iden-
(Fig. 3d) that were not covered by the developmental data. A rela- tified, and only the imaging system and surgeon’s name were kept
tively small dataset limits the model’s robustness to identify effec- for data analysis. Ex vivo and in vivo animal cases were conducted
tive tool features or surgical scenes when an unseen ESD technique during the animal trial sessions at CUHK Jockey Club Minimally
is involved. Our currently developed method has not particularly Invasive Surgical Skills Centre. Ethical approvals were obtained from
addressed this problem, while can be extendable with domain the Ethics Committee of The Chinese University of Hong Kong (No.
generalization48 and test-time adaptation49 strategies. Promisingly, 22-145-MIS).
the proposed model has shown a noteworthy degree of adaptability
to the variations encountered in surgical settings, such as differ- Problem formulation and network learning
ences in geographical locations and endoscopy systems. This Given an ESD video stream, this work formulates the phase recognition
matters in its wider application and multi-center deployment in task as an online classification task based on our previous work20.
T
the future. Given a video stream V = fx t 2 RH × W × 3 gt = 1 with T frames, we make the
Last but not least, future works of this study will continue to focus phase recognition model as a function F θ which classifies each frame
on AI assistance for ESD. The benefits of automatic phase recognition xt into one of four surgical phases according to probability prediction
go beyond the generation of statistical reports and the calculation of pt = F θ ðx 1 , x 2 , . . . , x t Þ, where each element represents the probability
online NT-index, which provide only a limited view of surgical skill of frame xt being phase in {Marking, Injection, Dissection, Idle}. Due to
evaluation. We encourage community researchers to utilize the open- the complexity of recognizing surgical phases with large intra-class
source code and data we provide to explore the statistical significance variation and inter-class similarity, we decompose F θ into two stages
of surgical phases and promote progress in surgical training and F θ = Gω  Hϕ , with Gω as the feature extractor to encode discriminative
related areas, such as establishing large-scale structured and seg- representation for each single frame, and Hϕ as the follow-up spatial-
mented surgical phase databases50. Besides, based on the high- temporal feature aggregator for yielding the final phase prediction
performance surgical phase recognition, we will extend the video incorporating video dynamics. An overview of our AI-Endo network is
analysis to semantic segmentation of surgical scenes such as the sub- illustrated in Fig. 7.
mucosal layer, muscle layer, and vessels in our future work. We In ESD surgery, the differences in anatomical structures and
implemented a preliminary segmentation model in this study’s in vivo lesion locations introduce considerable intra-class variances on xt,
animal experiment, which also fulfilled real-time prediction speed. We imposing challenges on Gω to learn discriminative frame-wise
will further improve its accuracy and accordingly investigate how to representations, which are the basis for spatial-temporal feature
use it to help surgeons reduce adverse events on safety-critical tissues. learning. We propose to rely on self-supervised learning with con-
Moreover, AI-enabled data analytics would provide cognitive assis- trastive loss in the training process, by formulating Lcon (see Eq.1)
tance and support decision-making for surgery, which has a large which enhances the similarity of embeddings from intra-class
potential to enhance surgical safety. As artificial intelligence is frames (a.k.a., positive pairs) while enlarging the distance between
increasingly investigated for surgical applications, its way of integra- inter-class frames (a.k.a., negative pairs). The embedding ei for each
tion in the operating room and clinical role for benefiting surgeons are frame xi is extracted using a pretrained ResNet5051 as backbone.
to be emphasized along the way. We aim to include clinical trials in our Meanwhile, in order to enhance the discrimination capability of
future work after the entire system is extensively validated with more learned features toward the phase recognition task, we also add
surgeons and clinical centers, ensuring participant safety in invasive cross-entropy loss with respect to phase labels annotated for each
procedures. frame. In these regards, the overall loss function training Gω is as

Nature Communications | (2023)14:6676 9


Article https://doi.org/10.1038/s41467-023-42451-8

follows: High-throughput online prediction


The application of this framework requires efficient deployment cou-
ω* = arg min LðGω ; fx i ,yi gi2I Þ = Lcon + Lce , pled with intraoperative video streaming. To achieve this goal, we
ω
X 1 X reduce the computation complexity by analyzing how the feature
expðei  en =τÞ
Lcon = log P , embeddings at each frame are updated according to the inception
jAðiÞj n2AðiÞ a2NðiÞ expðei  ea =τÞ ð1Þ
i2I field of the AI model. For the fusion module, rather than continuously
XX
4   storing all the spatial embeddings fei gti = 1 for the temporal reasoning of
Lce = 1 yi = = k NLLðGω ðx i Þ, yi Þ, TCN, we only selectively keep the embeddings within its inception
i2I k = 1
field. Concretely, given the inception field of TCN is 512, the spatial
where i denotes the index of frames in the mini-batch I. A(i) and N(i) embedding et+1 only interacts with 511 previous frames, i.e., accounting
respectively represent the frames that have the same and different for over 10 s under our inference speed of 47 fps. We build a first-in-
phase annotations with xi, and τ 2 R + denotes the scalar temperature first-out (FIFO) queue to dynamically store fei gti =+ t510
1
. When the spatial
parameter52. The 1fyi = = kg is the label indicator for the negative log- embedding ei is out of the inception field, it graduates from the queue.
likelihood function and equals 1 when yi = k otherwise 0. In addition, to Notably, this framework keeps high inference efficiency and also fully
facilitate real-time deployment, the pretrained feature backbone was preserves its accuracy.
pruned by removing the two linear projection heads. The contrastive For the model training, all cases of the developmental dataset
learning strategy enables the remained modules to still provide were first arranged in chronological order, and then we sampled them
meaningful embedding without increasing the computation overhead. at five equal intervals. This procedure resulted in 5 folds, with four
The final embedding for each frame was sequentially used as the input folds used for training and the remaining one for testing in a cross-
for the subsequent spatial-temporal feature learning. validation manner. The framework was optimized in two separate
Temporal reasoning is essential for AI-Endo to capture dynamic stages, i.e., training feature embedding Gω and spatial-temporal
information in the procedure, such as the trajectory of surgical tool information aggregation Hϕ . At the first stage, Gω was trained for
and its interaction with the targeting tissues. In this regard, we leverage 8000 iterations with batch size 128. The learning rate started from 5e−4
a fusion module to extract long-range temporal information with a and was reduced by 10 after 6000 iterations. When the first stage of
temporal convolution network (TCN). In order to aggregate the spatial training was finished, we fixed and utilized the trained model G*ω to
and temporal information and boost the capability of representation, generate the spatial embeddings of all frames for training Hϕ . At the
we further incorporate a global attention-based transformer module second stage, the model Hϕ was trained for 4000 iterations by
to capture supportive relationship based on spatial and temporal selecting all consecutive frames in a video as the input in each itera-
embeddings. tion. By adopting temporal convolution53, the model was empowered
For the fusion module, we use TCN to perform hierarchical to process all feature embeddings of the video in a causal manner, thus
refinement on temporal embedding. Given the spatial embedding preserving the characteristics necessary for online prediction. We set
t
sequence fei = Gω ðx i Þ 2 Rd gi = 1 , the TCN targets generating temporal the learning rate as 5e−3 at the beginning and multiplied it with 0.1 at
embedding by exploring inter-frame relationship. The TCN is com- iterations 1500 and 2500. The parameters of modules Gω and Hϕ were
posed of multi-level temporal convolution layers, each level of which both optimized by SGD with momentum. Supplementary Fig. 5 shows
includes consecutive dilated residual layers. Taking the lth layer as an the curves of training loss, where the loss curves became flat at the end
example, the output Dl+1 is calculated by Dl+1 = Dl + W2,l * {ReLU(W1,l * Dl of the training process. Therefore, models at the final iteration were
−1 + b1,l)} + b2,l, where W1,l and W2,l are the weights of dilated convolu- used for phase prediction. After optimizing and fixing the network
tion and 1 × 1 convolution, whose biases are denoted as b1,l, b2,l, structure and hyper-parameters, we proceeded to retrain the model
respectively. The first layer accepts D0 = fei gti = 1 as the initial input. To using the entire developmental dataset. This was done to maximize the
get a larger inception field of temporal convolution, we gradually amount of available training data and improve the model’s general-
increase the dilation factor by 2, which yields an increased size of the ization performance54. For any future applications of AI-Endo on other
inception field. The output Dl is shifted along the temporal dimension datasets, such as external, ex vivo, and in vivo animal studies, we used
so that the output DL+1, i.e., the temporal embedding mi 2 Rd0 , only the model that was trained on the entire developmental dataset.
relies on current and previous frames.
Although the fused spatial-temporal embedding at t integrates the Description of the animal studies
temporal information at neighboring frames, a fixed-size embedding Study design of live animal experiments. Two healthy female pigs of
representation is insufficient to deliver complicated information in ~30kg were used as the in vivo porcine models for ESD experiments
both dimensions of time and space. Therefore, we rely on the trans- under general anesthesia at CUHK Jockey Club Minimally Invasive
former module to obtain the phase prediction by further aggregating Surgical Skills Centre (CUHK MISSC). The procedures were performed
spatial and temporal information with global attention. Specifically, we with a high-definition endoscope (GIF-H190 with straight transparent
take the spatial embedding et as the query and the temporal embed- hood, Olympus Medical Corporation, Tokyo, Japan) and ESD knife
dings Mt, i.e., concatenation of fmi gti = tn + 1 , as the key and value, where (Dual knife J, Olympus Medical Corporation, Tokyo, Japan). The VIO3
n denotes the range of selected temporal embeddings before time (Erbe Elektromedizin GmbH, Germany) was used as the electrosurgical
point t. The spatial embedding et is first reduced to ^et with the same power platform. During the ESD procedures, circular lesions were pre-
dimension as that of mi through linear projection. Then ^et and Mt are marked in the porcine esophagus, stomach and rectum with 2 cm in
processed by transformer as: diameter for subsequent ESD simulation in aminal experiments. Due to
! the increasing difficulty in performing ESD in the stomach, esophagus
W q ^et × ðW k M t ÞT and rectum, the time required for each procedure also increased
Transð^et , M t Þ = sof tmax pffiffiffiffiffi W vMt , ð2Þ accordingly, especially for novice endoscopist. As a result, the number
d0
of procedure performed by each endoscopist were different. Specifi-
where W are the linear projection mapping metrics and cally, the experienced endoscopist performed seven procedures,
pt = sof tmaxðTransð^et ,M t ÞÞ yields the final phase prediction. The including 3 stomach, 2 esophagus and 2 rectum, while the novice
fusion module and transformer module can be trained end-to-end to endoscopist performed five procedures, 2 stomach, 2 esophagus and 1
derive the optimized model Hϕ* . The trained model is capable of rectum. Such a design of animal experiments can cover diverse sce-
extracting long-range spatial-temporal information. narios, therefore allowing us to observe the AI model’s efficacy in

Nature Communications | (2023)14:6676 10


Article https://doi.org/10.1038/s41467-023-42451-8

Fig. 8 | The desktop software of AI-Endo. The user interface includes basic information, phase prediction, AI result display and summary report generation button. The
software is integrated into real-time clinical settings.

general. Consent was obtained from the endoscopists to publish and trainee were satisfied with the AI-Endo software design and its way
identifiable information as shown in Fig. 4b. of incorporation into the existing operation system.
To integrate AI-Endo into real-time surgical workflow, we pack- The AI-Endo could generate an intelligent report that presents
aged the algorithms as a ready-to-use software providing automatic statistical information and a structural summary of training perfor-
data analytics and an interactive user interface (illustrated in Fig. 8). It mance (Fig. 5b). Taking advantage of automatic data analysis, the
was deployed on a standardized workstation with an Intel Xeon(R) 3.7 performance of the surgeon could be assessed immediately in an
GHz CPU and one GPU of NVIDIA GeForce RTX 3090. The video stream objective way. The specific content of the report includes basic infor-
from the Olympus endoscopy system was exported from the SDI port mation, phase statistics and skill assessment. First, the basic informa-
and converted through the SDI to a USB converter (U3SDH, ACASIS, tion shows the date, hospital, case name, endoscopist name, training
China), then imported to the AI-Endo software. This data was used as session and settings. Second, the phase statistics section visualizes the
input to the network for intelligent workflow recognition, and the entire procedure using a color bar, where the frames of each phase are
predicted outputs were visualized through the user interface. Our AI- marked in different colors over time, based on automatic recognition
Endo adopted a third-party monitor thus not changing the operation results. Meanwhile, the training status was shown aside, indicating the
style of an existing clinical setting. The user interface was mainly degree of guidance that the trainee received from the mentor, i.e.,
composed of three parts, including procedure basic information, independent, with help, or take over. As a quantitative analysis, we
phase recognition result and intelligent skill summary. Specifically, calculated the duration distribution across four phases, which is useful
using the AI-Endo software in practice, we can mark clinical basic for the overall understanding of the surgical skills for ESD9,36. We got
information, e.g., patient ID, surgeon name, lesion location and size, the percentage of each phase by counting the number of corre-
operation date, etc. As the procedure starts, the AI-Endo would auto- sponding frames from AI predictions, with the calculated ratio visua-
matically recognize the ongoing surgical phase at each time point, and lized in an intuitive pie chart. Third, our AI-Endo software further
overlay the results onto video frames dynamically. The computation analyzed the skill-aspect performance of endoscopists based on phase
speed can achieve 47 fps using the standard workstation, which suffi- recognition results. Online score of NT-index and offline scores of
ciently satisfies the requirements for real-time application. In live transition metrics and phase periods (which can be straightforwardly
experiments, the AI software could monitor the progress of the ESD derived from the above statistics) were reported. The curve of NT-
procedure, and timely reflect the smoothness of dissection, which was index was plotted along time in an overview format. The transition
useful for the mentor to easily track the practicing status of trainees. matrix was added to show the inter-changing frequency among pha-
We noticed that the AI-Endo was sensitive to the actions of surgical ses, which reflects the proficiency and smoothness of the operator.
instruments and their interaction with target tissues, which was Phase periods listed time duration of each phase, which was further
reflected in online predictions showing frequent transitions between used for calculating the Idle period/Tumor size, Idle period/Dissection
dissection and idle. Upon finishing the entire procedure, our AI-Endo period, and Procedure duration/Tumor size. Based on these skill
software can automatically generate a structural report to statistically assessments, we could compare and rank the performance of endos-
summarize the surgical workflow, and give an objective assessment of copists for the training session. Our experiment demonstrated that
the training session. All the developed functions are easy to use for this relative comparison matched the mentor’s subjective impression
surgeons without the need for coding experience. Both the mentor of the skill levels of different trainees. The design of the report

Nature Communications | (2023)14:6676 11


Article https://doi.org/10.1038/s41467-023-42451-8

template was a result of insights and rounds of discussions from both 12. Fourcade, A. & Khonsari, R. Deep learning in medical image analy-
engineers and clinicians. sis: a third eye for doctors. J. Stomatol. Oral and Maxillofac. Surg.
120, 279–288 (2019).
Statistical analysis 13. Garrow, C. R. et al. Machine learning for surgical phase recognition:
All statistical analyses were performed with Python (v3.6). For the a systematic review. Ann. Surg. 273, 684–693 (2021).
quantitative results of the performance on the development and 14. Varol, G., Laptev, I. & Schmid, C. Long-term temporal convolutions
external datasets, we adapted Student’s t-distribution with 95% con- for action recognition. IEEE Trans. Patt. Anal. Mach. Intell. 40,
fidence interval (CI: lower%, upper%). To compare the analytical results 1510–1517 (2017).
from different groups, we used a two-sided pairwise T-test to inspect 15. Funke, I. et al. Using 3D convolutional neural networks to learn spa-
their statistical difference. A P-value of <0.05 was considered as sta- tiotemporal features for automatic surgical gesture recognition in
tistically significant. video. In Proc. International Conference on Medical Image Computing
and Computer-Assisted Intervention, 467–475 (Springer, 2019).
Reporting summary 16. Zhang, B., Ghanem, A., Simes, A., Choi, H. & Yoo, A. Surgical
Further information on research design is available in the Nature workflow recognition with 3DCNN for sleeve gastrectomy. Int. J.
Portfolio Reporting Summary linked to this article. Comput. Assist. Radiol. Surg. 16, 2029–2036 (2021).
17. Czempiel, T. et al. TeCNO: surgical phase recognition with multi-
Data availability stage temporal convolutional networks. In Proc. International Con-
All data supporting the in vivo and ex vivo animal trial studies are ference on Medical Image Computing and Computer-Assisted
publicly available in the Figshare database https://doi.org/10.6084/ Intervention, 343–352 (Springer, 2020).
m9.figshare.23506866.v5. Due to ethical regulations on con- 18. Ramesh, S. et al. Multi-task temporal convolutional networks for
fidentiality and privacy, access to the human cases used for training joint recognition of surgical phases and steps in gastric bypass
and validating models is limited to authorized researchers approved procedures. Int. J. Comput. Assist. Radiol. Surg. 16, 1111–1119 (2021).
by the ethics committee. The timeframe for the ethics application 19. Jin, Y. et al. Temporal memory relation network for workflow
would be about two months. These data are available from the recognition from surgical video. IEEE Trans. Med. Imaging 40,
corresponding authors upon request with justification of specific 1911–1923 (2021).
usage of the data and non-commercial purposes. Source data are 20. Gao, X., Jin, Y., Long, Y., Dou, Q. & Heng, P.-A. Trans-SVNet: accu-
provided with this paper. rate phase recognition from surgical videos via hybrid embedding
aggregation transformer. In Proc. International Conference on
Code availability Medical Image Computing and Computer-Assisted Intervention,
AI-Endo was implemented with Python 3.6.13 and PyTorch 1.10.2. The 593–603 (Springer, 2021).
source code for this project is available at the GitHub repository 21. Ramanathan, M., Yau, W.-Y. & Teoh, E. K. Human action recognition
https://github.com/med-air/AI-Endo55. with video data: research and evaluation challenges. IEEE Trans.
Hum. Mach. Syst. 44, 650–663 (2014).
References 22. Le, V.-T., Tran-Trung, K. & Hoang, V. T. A comprehensive review of
1. Maier-Hein, L. et al. Surgical data science—from concepts toward recent deep learning techniques for human activity recognition.
clinical translation. Med. Image Anal. 76, 102306 (2022). Comput. Intell. Neurosci. 2022, 8323962 (2022).
2. Lalys, F. & Jannin, P. Surgical process modelling: a review. Int. J. 23. Ji, S., Xu, W., Yang, M. & Yu, K. 3D convolutional neural networks for
Comput. Assist. Radiol. Surg. 9, 495–511 (2014). human action recognition. IEEE Trans. Patt. Anal. Mach. Intell. 35,
3. Katić, D. et al. LapOntoSPM: an ontology for laparoscopic surgeries 221–231 (2012).
and its application to surgical phase recognition. Int. J. Comput. 24. Meli, D., Fiorini, P. & Sridharan, M. Towards inductive learning of
Assist. Radiol. Surg. 10, 1427–1434 (2015). surgical task knowledge: a preliminary case study of the peg
4. Zhang, J. et al. Symmetric dilated convolution for surgical gesture transfer task. Procedia Comput. Sci. 176, 440–449 (2020).
recognition. In Proc. 23rd International Conference Medical Image 25. Bar, O. et al. Impact of data on generalization of AI for surgical
Computing and Computer Assisted Intervention (MICCAI 2020), intelligence applications. Sci. Rep. 10, 1–12 (2020).
409–418 (Springer, 2020). 26. Vedula, S. S. & Hager, G. D. Surgical data science: the new knowl-
5. Lau, K. C., Yam, Y. & Chiu, P. W. Y. An advanced endoscopic surgery edge domain. Innov. Surg. Sci. 2, 109–121 (2017).
robotic platform for removal of early-stage gastrointestinal cancer 27. Maier-Hein, L. et al. Surgical data science for next-generation
using endoscopic submucosal dissection. HKIE Trans. 28, 186–198 interventions. Nat. Biomed. Eng. 1, 691–696 (2017).
(2021). 28. Hashimoto, D. A., Rosman, G., Rus, D. & Meireles, O. R. Artificial
6. Hamilton, J. M. et al. Toward effective pediatric minimally invasive intelligence in surgery: promises and perils. Ann. Surg. 268,
surgical simulation. J. Pediatr. Surg. 46, 138–144 (2011). 70–76 (2018).
7. Takazawa, S. et al. Video-based skill assessment of endoscopic 29. Chiu, P. W.-y, Zhou, S. & Dong, Z. A look into the future of endo-
suturing in a pediatric chest model and a box trainer. J. Lapar- scopic submucosal dissection and third space endoscopy: the role
oendosc. Adv. Surg. Tech. 25, 445–453 (2015). for robotics and other innovation. Gastrointest. Endosc. Clin. 33,
8. Wälter, A. et al. Video-based assessment of practical operative skills 197–212 (2023).
for undergraduate dental students. Trends Comput. Sci. Inf. Tech- 30. Guzmán-García, C., Sánchez-González, P., Oropesa, I. & Gómez, E.
nol. 3, 005–014 (2018). J. Automatic assessment of procedural skills based on the surgical
9. Takeuchi, M. et al. Automated surgical-phase recognition for robot- workflow analysis derived from speech and video. Bioengineering
assisted minimally invasive esophagectomy using artificial intelli- 9, 753 (2022).
gence. Ann. Surg. Oncol. 29, 6847–6855 (2022). 31. Liu, D. et al. Towards unified surgical skill assessment. In Proc. IEEE/
10. Shen, D., Wu, G. & Suk, H.-I. Deep learning in medical image ana- CVF Conference on Computer Vision and Pattern Recognition,
lysis. Annu. Rev. Biomed. Eng. 19, 221 (2017). 9522–9531 (2021).
11. Carin, L. & Pencina, M. J. On deep learning for medical image 32. Gao, X., Jin, Y., Zhao, Z., Dou, Q. & Heng, P.-A. Future frame
analysis. JAMA 320, 1192–1193 (2018). prediction for robot-assisted surgery. In Proc. International

Nature Communications | (2023)14:6676 12


Article https://doi.org/10.1038/s41467-023-42451-8

Conference on Information Processing in Medical Imaging, 53. Oord, A. v. d. et al. WaveNet: A Generative Model for Raw Audio,
533–544 (Springer, 2021). 125–125 (International Speech Communication Association,
33. Hotta, K. et al. Learning curve for endoscopic submucosal dis- 2016).
section of large colorectal tumors. Dig. Endosc. 22, 302–306 54. Ebrahimi Kahou, S., Michalski, V., Konda, K., Memisevic, R. & Pal, C.
(2010). Recurrent neural networks for emotion recognition in video. In Proc.
34. Oda, I., Odagaki, T., Suzuki, H., Nonaka, S. & Yoshinaga, S. Learning 2015 ACM on International Conference on Multimodal Interaction,
curve for endoscopic submucosal dissection of early gastric cancer 467–474 (2015).
based on trainee experience. Dig. Endosc. 24, 129–132 (2012). 55. Cao, J. et al. Intelligent surgical workflow recognition for endo-
35. Tsou, Y.-K. et al. Learning curve for endoscopic submucosal dis- scopic submucosal dissection with real-time animal study. GitHub
section of esophageal neoplasms. Dis. Esophagus 29, 544–550 https://github.com/med-air/AI-Endo (2023).
(2016).
36. Cetinsaya, B. et al. A task and performance analysis of endoscopic Acknowledgements
submucosal dissection (ESD) surgery. Surg. Endosc. 33, We thank Dr. Alanna Ebigbo, MD, Dr. Andreas Probst, MD, and Dr. Helmut
592–606 (2019). Messmann, MD for help in collecting external validation data from
37. Dou, Q. et al. Automatic detection of cerebral microbleeds from MR Internal Medicine III-Gastroenterology, University Hospital of Augsburg,
images via 3D convolutional neural networks. IEEE Trans. Med. Augsburg, Germany. We also thank Dr. Side Liu, MD, for helping on
Imaging 35, 1182–1195 (2016). collecting external validation data from Nanfang Hospital, Southern
38. de Tejada, A. H. ESD training: a challenging path to excellence. Medical University in Guangzhou, China. This research was partially
World J. Gastrointest. Endosc. 6, 112 (2014). supported by the Centre for Perceptual and Interactive Intelligence
39. Takezawa, T. et al. The pocket-creation method facilitates colonic (CPII) Ltd and Multi-scale Medical Robotics Center (MRC) Ltd under the
endoscopic submucosal dissection (with video). Gastrointest. HKSARG’s Innovation and Technology Commission (ITC)’s InnoHK
Endosc. 89, 1045–1053 (2019). Scheme, Hong Kong Innovation and Technology Commission Project
40. Yoshida, M. et al. Conventional versus traction-assisted endoscopic No. ITS/237/21FP, Hong Kong Research Grants Council Project No. T45-
submucosal dissection for gastric neoplasms: a multicenter, ran- 401/22-N, Guangdong Science and Technology Plan Project No.
domized controlled trial (with video). Gastrointest. Endosc. 87, 2022A1515011477. The funder had no role in the study’s conduct, design,
1231–1240 (2018). data collection, interpretation, or writing the report.
41. Aspart, F. et al. Clipassistnet: bringing real-time safety feedback
to operating rooms. Int. J. Comput. Assist. Radiol. Surg. 17, 5–13 Author contributions
(2022). Q.D., M.L.M., W.Y.C., Y.Y. and H.C.Y. conceived and designed the study;
42. Fluss, R., Faraggi, D. & Reiser, B. Estimation of the Youden index and J.F.C., H.C.Y., Y.Y.C., M.S., X.B.L., M.K.C., and W.Y.C. were responsible
its associated cutoff point. Biom. J. 47, 458–472 (2005). for collecting and organizing data as well as conducting manual anno-
43. Ruopp, M. D., Perkins, N. J., Whitcomb, B. W. & Schisterman, E. F. tations; J.F.C., Y.Y.C., H.Z.Y., Y.H.L., Y.M.J., Y.Y., M.L.M., and Q.D. con-
Youden index and optimal cut-point estimated from observations ducted artificial intelligence framework formulation, deep learning
affected by a lower limit of detection. Biom. J. 50, 419–430 algorithm implementations, experimental design and data analysis;
(2008). H.C.Y., M.S., and W.Y.C. provided feedback on the developed models
44. Martin, J. et al. Objective structured assessment of technical skill from a clinical perspective and conducted animal trials; J.F.C., H.C.Y.,
(OSATS) for surgical residents. Br. J. Surg. 84, 273–278 (1997). Y.M.J., M.L.M., W.Y.C., Y.Y., and Q.D. co-wrote the manuscript with all the
45. Doyle, J. D., Webber, E. M. & Sidhu, R. S. A universal global rating other authors providing constructive feedback for revising the manu-
scale for the evaluation of technical skills in the operating room. script. All authors read and approved the final manuscript.
Am. J. Surg. 193, 551–555 (2007).
46. Ahn, J. Y. et al. Procedure time of endoscopic submucosal dissec- Competing interests
tion according to the size and location of early gastric cancers: The authors declare no competing interests.
analysis of 916 dissections performed by 4 experts. Gastrointest.
Endosc. 73, 911–916 (2011). Additional information
47. Twinanda, A. P. et al. Endonet: a deep architecture for recognition Supplementary information The online version contains
tasks on laparoscopic videos. IEEE Trans. Med. Imaging 36, 86–97 supplementary material available at
(2016). https://doi.org/10.1038/s41467-023-42451-8.
48. Dou, Q., Coelho de Castro, D., Kamnitsas, K. & Glocker, B. Domain
generalization via model-agnostic learning of semantic features. In Correspondence and requests for materials should be addressed to
Proc. 33rd Conference on Neural Information Processing Systems Hon-Chi Yip, Philip Wai-Yan Chiu, Yeung Yam, Helen Mei-Ling Meng or Qi
(NeurIPS 2019) (2019). Dou.
49. Yang, H. et al. Dltta: dynamic learning rate for test-time adaptation
on cross-domain medical images. IEEE Trans. Med. Imaging 41, Peer review information Nature Communications thanks Mitsuhiro
3575–3586 (2022). Fujishiro and the other anonymous reviewer(s) for their contribution to
50. Mascagni, P. et al. Computer vision in surgery: from potential to the peer review of this work. A peer review file is available.
clinical value. NPJ Digit. Med. 5, 163 (2022).
51. He, K., Zhang, X., Ren, S. & Sun, J. Deep residual learning for image Reprints and permissions information is available at
recognition. In Proc. IEEE Conference on Computer Vision and Pat- http://www.nature.com/reprints
tern Recognition, 770–778 (2016).
52. Khosla, P. et al. Supervised contrastive learning. Adv. Neural Inf. Publisher’s note Springer Nature remains neutral with regard to jur-
Process. Syst. 33, 18661–18673 (2020). isdictional claims in published maps and institutional affiliations.

Nature Communications | (2023)14:6676 13


Article https://doi.org/10.1038/s41467-023-42451-8

Open Access This article is licensed under a Creative Commons


Attribution 4.0 International License, which permits use, sharing,
adaptation, distribution and reproduction in any medium or format, as
long as you give appropriate credit to the original author(s) and the
source, provide a link to the Creative Commons licence, and indicate if
changes were made. The images or other third party material in this
article are included in the article’s Creative Commons licence, unless
indicated otherwise in a credit line to the material. If material is not
included in the article’s Creative Commons licence and your intended
use is not permitted by statutory regulation or exceeds the permitted
use, you will need to obtain permission directly from the copyright
holder. To view a copy of this licence, visit http://creativecommons.org/
licenses/by/4.0/.

© The Author(s) 2023

Nature Communications | (2023)14:6676 14

You might also like