0% found this document useful (0 votes)
15 views

Embryo Classification Paper2

Uploaded by

dakc.cse
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views

Embryo Classification Paper2

Uploaded by

dakc.cse
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

Human Reproduction, Vol.35, No.4, pp.

770–784, 2020
Advance Access Publication on April 2, 2020 doi:10.1093/humrep/deaa013

ORIGINAL ARTICLE Embryology

Development of an artificial
intelligence-based assessment model
for prediction of embryo viability using
static images captured by optical light

Downloaded from https://academic.oup.com/humrep/article/35/4/770/5815143 by guest on 27 October 2023


microscopy during IVF
M. VerMilyea1,2,† , J.M.M. Hall3,4,† , S.M. Diakiw3 , A. Johnston3,5 ,
T. Nguyen3 , D. Perugini3 , A. Miller1 , A. Picou1 , A.P. Murphy3 , and
M. Perugini3,6,*
1
Laboratory Operations, Ovation Fertility, Austin, TX 78731, USA 2 IVF Laboratory, Texas Fertility Center, Austin, TX 78731, USA 3 Life
Whisperer Diagnostics, Presagen Pty Ltd., Adelaide, SA 5000, Australia 4 Australian Research Council Centre of Excellence for Nanoscale
BioPhotonics, The University of Adelaide, Adelaide, SA 5000, Australia 5 Australian Institute for Machine Learning, School of Computer
Science, The University of Adelaide, Adelaide, SA 5000, Australia 6 Adelaide Medical School, Faculty of Health Sciences, The University of
Adelaide, Adelaide, SA 5000, Australia

*Correspondence address. [email protected]

Submitted on October 13, 2019; resubmitted on December 23, 2019; editorial decision on January 16, 2020

STUDY QUESTION: Can an artificial intelligence (AI)-based model predict human embryo viability using images captured by optical light
microscopy?
SUMMARY ANSWER: We have combined computer vision image processing methods and deep learning techniques to create the non-
invasive Life Whisperer AI model for robust prediction of embryo viability, as measured by clinical pregnancy outcome, using single static images
of Day 5 blastocysts obtained from standard optical light microscope systems.
WHAT IS KNOWN ALREADY: Embryo selection following IVF is a critical factor in determining the success of ensuing pregnancy.
Traditional morphokinetic grading by trained embryologists can be subjective and variable, and other complementary techniques, such as
time-lapse imaging, require costly equipment and have not reliably demonstrated predictive ability for the endpoint of clinical pregnancy. AI
methods are being investigated as a promising means for improving embryo selection and predicting implantation and pregnancy outcomes.
STUDY DESIGN, SIZE, DURATION: These studies involved analysis of retrospectively collected data including standard optical light
microscope images and clinical outcomes of 8886 embryos from 11 different IVF clinics, across three different countries, between 2011 and
2018.
PARTICIPANTS/MATERIALS, SETTING, METHODS: The AI-based model was trained using static two-dimensional optical light
microscope images with known clinical pregnancy outcome as measured by fetal heartbeat to provide a confidence score for prediction of
pregnancy. Predictive accuracy was determined by evaluating sensitivity, specificity and overall weighted accuracy, and was visualized using
histograms of the distributions of predictions. Comparison to embryologists’ predictive accuracy was performed using a binary classification
approach and a 5-band ranking comparison.
MAIN RESULTS AND THE ROLE OF CHANCE: The Life Whisperer AI model showed a sensitivity of 70.1% for viable embryos while
maintaining a specificity of 60.5% for non-viable embryos across three independent blind test sets from different clinics. The weighted overall
accuracy in each blind test set was >63%, with a combined accuracy of 64.3% across both viable and non-viable embryos, demonstrating model
robustness and generalizability beyond the result expected from chance. Distributions of predictions showed clear separation of correctly and
incorrectly classified embryos. Binary comparison of viable/non-viable embryo classification demonstrated an improvement of 24.7% over
embryologists’ accuracy (P = 0.047, n = 2, Student’s t test), and 5-band ranking comparison demonstrated an improvement of 42.0% over
embryologists (P = 0.028, n = 2, Student’s t test).

† The authors consider that the first two authors should be regarded as joint first authors.
© The Author(s) 2020. Published by Oxford University Press on behalf of the European Society of Human Reproduction and Embryology.
This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/4.0/), which per-
.
. work is properly cited. For commercial re-use, please contact [email protected]
mits non-commercial re-use, distribution, and reproduction in any medium, provided the original
.
.
.
.
.
.
.
Artificial intelligence for embryo viability assessment 771

LIMITATIONS, REASONS FOR CAUTION: The AI model developed here is limited to analysis of Day 5 embryos; therefore, further
evaluation or modification of the model is needed to incorporate information from different time points. The endpoint described is clinical
pregnancy as measured by fetal heartbeat, and this does not indicate the probability of live birth. The current investigation was performed with
retrospectively collected data, and hence it will be of importance to collect data prospectively to assess real-world use of the AI model.
WIDER IMPLICATIONS OF THE FINDINGS: These studies demonstrated an improved predictive ability for evaluation of embryo
viability when compared with embryologists’ traditional morphokinetic grading methods. The superior accuracy of the Life Whisperer AI
model could lead to improved pregnancy success rates in IVF when used in a clinical setting. It could also potentially assist in standardization of
embryo selection methods across multiple clinical environments, while eliminating the need for complex time-lapse imaging equipment. Finally,
the cloud-based software application used to apply the Life Whisperer AI model in clinical practice makes it broadly applicable and globally
scalable to IVF clinics worldwide.
STUDY FUNDING/COMPETING INTEREST(S): Life Whisperer Diagnostics, Pty Ltd is a wholly owned subsidiary of the parent
company, Presagen Pty Ltd. Funding for the study was provided by Presagen with grant funding received from the South Australian Government:

Downloaded from https://academic.oup.com/humrep/article/35/4/770/5815143 by guest on 27 October 2023


Research, Commercialisation and Startup Fund (RCSF). ‘In kind’ support and embryology expertise to guide algorithm development were
provided by Ovation Fertility. J.M.M.H., D.P. and M.P. are co-owners of Life Whisperer and Presagen. Presagen has filed a provisional patent
for the technology described in this manuscript (52985P pending). A.P.M. owns stock in Life Whisperer, and S.M.D., A.J., T.N. and A.P.M. are
employees of Life Whisperer.

Key words: assisted reproduction / embryo quality / IVF/ICSI outcome / artificial intelligence / machine learning

.
Introduction .
.
approach to aid in embryo selection during IVF, using single static
. two-dimensional images captured by optical light microscopy methods.
.
With global fertility generally declining (GBD, 2018), many couples and . The aim was to combine computer vision image processing methods
.
individuals are turning to assisted reproduction procedures for help . and deep learning to create a robust model for analysis of Day 5
.
with conception. Unfortunately, success rates for IVF are quite low at . embryos (blastocysts) for prediction of clinical pregnancy outcomes.
.
∼20–30% (Wang and Sauer, 2006), placing significant emotional and . This is the first report of an AI-based embryo selection method that
.
financial strain on those seeking to achieve a pregnancy. During the .
. can be used for analysis of images taken using standard optical light
IVF process, one of the critical determinants of a successful pregnancy .
. microscope systems, without requiring time-lapse imaging equipment
.
is embryo quality, and the embryo selection process is essential for . for operation, and which is predictive of pregnancy outcome. Using an
.
ensuring the shortest time to pregnancy for the patient. There is a . AI screening method to improve selection of embryos prior to transfer
.
pressing motivation to improve the way in which embryos are selected . has the potential to improve IVF success rates in a clinical setting.
.
for transfer into the uterus during the IVF process. .
. A common challenge in evaluating AI and machine learning methods
Currently, embryo selection is a manual process involving assessment .
. in the medical industry is that each clinical domain is unique, and
of embryos by trained clinical embryologists, through visual inspection .
. requires a specialized approach to address the issue at hand. There
of morphological features using an optical light microscope. The most .
. is a tendency for industry to compare the accuracy of AI in one
.
common scoring system used by embryologists is the Gardner Scale .
. clinical domain to another, or to compare the accuracy of different AI
(Gardner and Sakkas, 2003), in which morphological features such .
. approaches within a domain that assess different endpoints. These are
as inner cell mass (ICM) quality, trophectoderm quality and embryo .
. not valid comparisons as they do not consider the clinical context, nor
developmental advancement are evaluated and graded according to .
. the relevance of the ground-truth endpoint used for the assessment
an alphanumeric scale. One of the key challenges in embryo grading .
. of the AI. Caution needs to be taken to understand the context in
is the high level of subjectivity and intra- and inter-operator variability .
.
. which the AI is operating and the benefit it provides in complement
that exists between embryologists of different skill levels (Storr et al., .
. with current clinical processes. One example presented by Sahlsten
2017). This means that standardization is difficult even within a single .
. et al. (2019) described an AI model that detected fundus for diabetic
laboratory, and impossible across the industry as a whole. Other .
. retinopathy assessment with an accuracy of over 90%. In this domain,
complementary techniques are available for assisting with embryo .
. the clinician baseline accuracy is ∼90%, and therefore an AI accuracy of
selection, such as time-lapse imaging, which continuously monitors the .
.
growth of embryos in culture with simple algorithms that assess critical . >90% is reasonable and necessary to justify clinical relevance. Similarly,
.
growth milestones. Although this approach is useful in determining . in the field of embryology, AI models developed by Khosravi et al.
.
whether an embryo at an early stage will develop through to a mature . (2019) and Kragh et al. (2019) showed high levels of accuracy in
.
. classification of blastocyst images according to the well-established
blastocyst, it has not been demonstrated to reliably predict pregnancy .
. Gardner scale. This approach is expected to yield a high accuracy,
outcomes, and therefore is limited in its utility for embryo selection at .
.
the Day 5 time point (Chen et al., 2017). Additionally, the requirement . as the model simply mimics the Gardner scale to predict a known
.
for specialized time-lapse imaging hardware makes this approach cost . outcome. While this method may be useful for standardization of
.
prohibitive for many laboratories and clinics, and limits widespread use . embryo classification according to the Gardner scale, it is not in fact
.
of the technique. . predictive of pregnancy success as it is based on a different endpoint.
.
. Of relevance to the current study, an AI model developed by Tran et al.
The objective of the current clinical investigation was to develop .
.
and test a non-invasive artificial intelligence (AI)-based assessment . (2019) was in fact intended to classify embryo quality based on clinical
772 VerMilyea et al.

Table I Results of pilot study demonstrate feasibility of creating an artificial intelligence-based image analysis model for
prediction of human embryo viability.

Validation dataset Blind Test Set 1 Blind Test Set 2 Total blind test dataset
.......................................................................................................................................................................................
Composition of dataset
.......................................................................................................................................................................................
Total image no. 390 368 632 1000
No. of positive clinical pregnancies 70 76 194 270
No. of negative clinical pregnancies 320 292 438 730
.......................................................................................................................................................................................
AI model accuracy
.......................................................................................................................................................................................
Accuracy viable embryos (sensitivity) 74.3% 63.2% 78.4% 74.1%

Downloaded from https://academic.oup.com/humrep/article/35/4/770/5815143 by guest on 27 October 2023


Accuracy non-viable embryos (specificity) 74.4% 77.1% 57.5% 65.3%
Overall accuracy 74.4% 74.2% 63.9% 67.7%
.......................................................................................................................................................................................
Comparison to embryologist grading – viable versus non-viable embryos
.......................................................................................................................................................................................
No. of images with embryologist grade ND 121 477 598
AI model accuracy ND 71.9% 65.4% 66.7%
Embryologist accuracy ND 47.1% 52.0% 51.0%
AI model improvement ND 52.7% 25.8% 30.8%
No. times AI model correcta ND 42 106 148
a
No. times embryologist correct ND 12 42 54
AI model fold improvement ND 3.5 x 2.5 x 2.7 x

a Images where the AI model was correct and the embryologist was incorrect, and vice versa.

AI = artificial intelligence; ND = not done; No. = number.

pregnancy outcome. This study did not report percentage accuracy . gists would be considered highly relevant in this clinical domain. In the
.
.
of prediction, but instead reported a high level of accuracy for their . current study, we aimed to develop an AI model that demonstrated
.
model IVY using a receiver operating characteristic (ROC) curve. The . superiority to embryologists’ predictive power for embryo viability, as
.
AUC for IVY was 0.93 for true positive rate versus false positive rate; . determined by ground-truth clinical pregnancy outcome.
.
negative predictions were not evaluated. However, the datasets used .
.
for training and evaluation of this model were only partly based on .
actual ground-truth clinical pregnancy outcome—a large proportion
.
.
.
Materials and Methods
.
of predicted non-viable embryos were never actually transferred, and . Experimental design
.
were only assumed to lead to an unsuccessful pregnancy outcome. .
. These studies were designed to analyze retrospectively collected data
Thus, the reported performance is not entirely relevant in the context .
.
of clinical applicability, as the actual predictive power for the presence . for development and testing of the AI-based model in prediction of
.
of a fetal heartbeat has not been evaluated to date. . embryo viability. Data were collected for female subjects who had
.
The AI approach presented here is the first study of its kind to evaluate . undergone oocyte retrieval, IVF and embryo transfer. The investiga-
.
. tion was non-interventional, and results were not used to influence
the true ability of AI for predicting pregnancy outcome, by exclu- .
. treatment decisions in any way. Data were obtained for consecutive
sively using ground-truth pregnancy outcome data for AI development .
. patients who had undergone IVF at 11 independent clinics in three
and testing. It is important to note that while a pregnancy outcome .
.
endpoint is more clinically relevant and informative; it is inherently . countries (the USA, Australia and New Zealand) from 2011 to 2018.
.
more complex in nature due to patient and laboratory variability that . Data were limited to patients who received a single embryo transfer
.
impact pregnancy success rates beyond the quality of the embryo itself. . with a Day 5 embryo, and where the endpoint was clinical pregnancy
.
The theoretical maximum accuracy for prediction of this endpoint . outcome as measured by fetal heartbeat at first scan. The clinical preg-
.
.
based on evaluation of embryo quality is estimated to be ∼80%, with . nancy endpoint was deemed to be the most appropriate measure of
.
the remaining 20% affected by patient-related clinical factors, such as . embryo viability as this limited any confounding patient-related factors
.
endometriosis, or laboratory process errors in embryo handling, etc., . post-implantation. Criteria for inclusion/exclusion were established
.
that could lead to a negative pregnancy outcome despite a morpho- . prospectively, and images not matching the criteria were excluded from
.
logically favorable embryo appearance (Annan et al., 2013). Given . analysis.
.
the influence of confounding variables, and the low average accuracy . For inclusion in the study, images were required to be of embryos
.
.
presently demonstrated by embryologists using traditional morpholog- . on Day 5 of culture taken using a standard optical light microscope
.
ical grading methods (∼50% in the current study, Tables I and II), an AI . mounted camera. Images were only accepted if they were taken prior
.
model with an improvement of even 10–20% over that of embryolo- . to PGS biopsy or freezing. All images were required to have a minimum
Artificial intelligence for embryo viability assessment 773

Table II Results of the pivotal study demonstrate generalizability of the Life Whisperer AI model for prediction of human
embryo viability across multiple clinical environments.

Validation Blind Test Blind Test Blind Test Combined


dataset Set 1 Set 2 Set 3 blind sets
.......................................................................................................................................................................................
Composition of dataset
.......................................................................................................................................................................................
No. of images 193 280 286 1101 1667
No. of positive clinical pregnancies 97 141 180 334 655
No. of negative clinical pregnancies 96 139 106 767 1012
.......................................................................................................................................................................................
AI model accuracy
.......................................................................................................................................................................................

Downloaded from https://academic.oup.com/humrep/article/35/4/770/5815143 by guest on 27 October 2023


Accuracy viable embryos (sensitivity) 76.3% 72.3% 73.9% 67.1% 70.1%
Accuracy non-viable embryos (specificity) 53.1% 54.7% 54.7% 62.3% 60.5%
Overall accuracy 64.8% 63.6% 66.8% 63.8% 64.3%
.......................................................................................................................................................................................
Comparison to embryologist grading – viable versus non-viable embryos
.......................................................................................................................................................................................
No. of images with embryologist grade ND 262 0a 539 801
AI model accuracy ND 63.7% ND 57.0% 59.2%
Embryologist accuracy ND 50.4% ND 46.0% 47.4%
AI model improvement ND 26.4% ND 23.8% 24.7%
No. times AI model correctb ND 71 ND 101 172
No. times embryologist correctb ND 36 ND 42 78
AI model fold improvement ND 2.0 x ND 2.4 x 2.2 x
.......................................................................................................................................................................................
Comparison to embryologist grading – embryo ranking
.......................................................................................................................................................................................
AI model ranking correctb ND 44.3% ND 38.8% 40.6%
b
Embryologist ranking correct ND 30.5% ND 27.6% 28.6%
AI model improvement ND 45.2% ND 40.6% 42.0%

a Embryologist
scores were not available. b Images where the AI model was correct and the embryologist was incorrect, and vice versa.
ND = not done; No. = number.

.
resolution of 512 x 512 pixels with the complete embryo in the . review pursuant to Section 3 of the Standard Operating Procedures for
.
field of view. Additionally, all images were required to have matched . Health and Disability Ethics Committees, Version 2.0 (August 2014).
.
clinical pregnancy outcome available (as detected by the presence of . These studies were not registered as a clinical trial as they did
.
a fetal heartbeat on first ultrasound scan). For a subset of patients, . not meet the definition of an applicable clinical trial as defined
.
the embryologist’s morphokinetic grade was available, and was used . by the ICMJE, that is, ‘a clinical trial is any research project that
.
.
to compare the accuracy of the AI with the standard visual grading . prospectively assigns people or a group of people to an intervention,
.
method for those patients. . with or without concurrent comparison or control groups, to study
.
. the relationship between a health-related intervention and a health
.
. outcome’.
Ethics and compliance .
.
.
All patient data used in these studies were retrospective and pro- .
. Viability scoring methods
vided in a de-identified format. In the USA, the studies described .
.
were deemed exempt from Institutional Review Board (IRB) review . For the AI model, an embryo viability score of 50% and above was
.
pursuant to the terms of the United States Department of Health and . considered viable, and below 50% non-viable. Embryologist’s scores
.
Human Service’s Policy for Protection of Human Research Subjects . were provided for Day 5 blastocysts at the time when the image
.
at 45 C.F.R. § 46.101(b) (IRB ID #6467, Sterling IRB). In Australia, . was taken. These scores were placed into scoring bands, which
.
.
the studies described were deemed exempt from Human Research . were roughly divided into ‘likely viable’ and ‘likely non-viable’ groups.
.
Ethics Committee review pursuant to Section 5.1.2.2 of the National . This generalization allowed comparison of binary predictions from
.
Statement on Ethical Conduct in Human Research 2007 (updated . the embryologists with predictions from the AI model (viable/non-
.
2018), in accordance with the National Health and Medical Research . viable). The scoring system used by embryologists was based on
.
Council Act 1992 (Monash IVF). In New Zealand, the studies described . the Gardner scale of morphokinetic grading (Gardner and Sakkas,
.
.
were deemed exempt from Health and Disability Ethics Committee . 2003) for the quality of the ICM and the trophectoderm of the
774 VerMilyea et al.

.
embryo, indicated by a single letter (A–E). Also included was either a .
.
numerical score or a description of the embryo’s stage of development .
.
toward hatching. Numbers were assigned in ascending order of .
.
embryo development as follows: 1 = start of cavitation, 2 = early .
.
blastocyst, 3 = full blastocyst, 4 = expanded blastocyst and 5 = hatching .
.
.
blastocyst. If no developmental stage was given, it was assumed that .
.
the embryo was at least an early blastocyst (>2). The conversion .
.
table for all clinics providing embryologists scores is provided in .
.
Supplementary Table SI. For embryologist’s scores, embryos of 3BB .
.
or higher grading were considered viable, and below 3BB considered .
.
non-viable. .
.
.
Comparisons of embryo viability ranking were made by equating .
.
the embryologist’s assessment with a numerical score from 1 to .

Downloaded from https://academic.oup.com/humrep/article/35/4/770/5815143 by guest on 27 October 2023


. Figure 1 Sample image of human embryo with pre-
5 and, similarly, dividing the AI model inferences into five equal .
. processing steps applied in order. The six main pre-processing
bands labeled 1 to 5 (from the minimum inference to the maximum .
. steps, prior to transforming the image into a tensor format, are
inference). If a given embryo image was given the same rank by the .
. illustrated. (A) The input image is stripped of the alpha channel. (B)
.
AI model and the embryologist, this was noted as a ‘concordance’. If, . The image is padded to square dimensions. (C) The color balance and
.
however, the AI model provided a higher rank than the embryologist . brightness levels are normalized. (D) The image is cropped to remove
.
and the ground-truth outcome was recorded as viable, or the AI . excess background space such that the embryo is centered. (E) The
.
model provided a lower rank than the embryologist and the ground- . image is scaled in resolution for the appropriate neural network. (F–
.
truth outcome was recorded as non-viable, then this outcome was . G) Segmentation is applied to the image as a pre-processing step for
.
noted as ‘model correct’. Similarly, if the AI model provided a lower . portion of the neural networks. An image with the inner cell mass
.
. (ICM) and intra-zona cavity (IC) masked is shown in (F) and an image
rank than the embryologist and the ground-truth outcome was .
. with the ICM/IC exposed is shown in (G). Images were taken at 200x
recorded as viable, or the AI model provided a higher rank and the .
. magnification.
outcome was recorded as non-viable, this outcome was noted as .
.
‘embryologist correct’. .
.
.
.
. width and height, and so that the center of the ellipse is the center
Computer vision image processing methods .
.
. of the new image.
All image data underwent a pre-processing stage, as outlined below. .
. – Each image was then scaled to a smaller resolution prior to
These computer vision image processing methods were used in model .
. training.
development, and incorporated into the final AI model. .
. – For training of selected models, images underwent an additional
.
– Each image was stripped of its alpha channel to ensure that it was . pre-processing step called boundary-based segmentation. This
.
.
encoded in a 3-channel format (e.g. RGB). This step removed . process acts by separating the region of interest (i.e. the embryo)
.
additional information from the image relating to transparency . from the image background, and allows masking in order to
.
maps, while incurring no visual change to the image. These por- . concentrate the model on classifying the gross morphological
.
tions of the image were not used. . shape of the embryo.
.
– Each image was padded to square dimensions, with each side . – Finally, each image was transformed to a tensor rather than a
.
equal to the longest side of the original image. This process . visually displayable image, as this is the required data format for
.
.
ensured that image dimensions were consistent, comparable and . deep learning models. Tensor normalization was obtained from
.
compatible for deep learning methods, which explicitly require . standard pre-trained ImageNet values, mean (0.485, 0.456, 0.406)
.
square dimension images as input, while also ensuring that no key . and standard deviation (0.299, 0.224, 0.225). Figure 1 shows an
.
components of the image were cropped. . example embryo image carried through the first six pre-processing
.
– Each image was RGB color normalized, by taking the mean of . steps described above.
.
each RGB channel, and dividing each channel by its mean value. .
.
. Embryo images obtained for this study were divided into training,
Each channel was then multiplied by a fixed value of 100/255, in .
. validation and blind dataset categories by randomizing available data
order to ensure the mean value of each image in RGB space was .
. with the constraint that each dataset was to have an even distribution of
(100, 100, 100). This step ensured that color biases among the .
. examples across each of the classifications (i.e. the same ratio of viable
images were suppressed, and that the brightness of each image .
. to non-viable embryos). For model training, the images in the training
was normalized. .
. dataset were additionally manipulated using a set of augmentations.
.
– Each image was then cropped so that the center of the embryo .
. Augmentations are required for training in order to anticipate changes
was in the center of the image. This was carried out by extracting .
. to lighting conditions, rotation of the embryo and focal length so that
the best ellipse fit from an elliptical Hough transform, calculated .
. the final model is robust to these conditions from new unseen datasets.
on the binary threshold map of the image. This method acts by .
. The augmentations used in model training are as follows:
selecting the hard boundary of the embryo in the image, and .
.
by cropping the square boundary of the new image so that the . – Rotations: Images were rotated a number of ways, including
.
.
longest radius of the new ellipse is encompassed by the new image . 90 degree rotations, and also other non-90 degree rotations
Artificial intelligence for embryo viability assessment 775

.
where the diagonal whitespace in the square image due to these . In all cases, use of ImageNet pretrained weights demonstrated
.
rotations was filled in with background color using the OpenCV . improved performance of these quantities.
.
(version 3.2.0; Willow Garage, Itseez, Inc. and Intel Corporation; .
. Loss functions that were evaluated as options for the model’s hyper-
2200 Mission College Blvd. Santa Clara, CA 95052, USA) ‘Bor- .
. parameters included cross entropy (CE), weighted CE and residual
der_Replicate’ method, which uses the pixel values near the image .
. CE loss function. The accuracy on the validation set was used as the
.
border to fill in whitespace after rotation. .
. selection criterion to determine a loss function. Following evaluation,
– Reflections: Horizontal or vertical reflections of the image were .
. only weighted CE and residual CE loss functions were chosen for use
also included as training augmentations. .
. in the final model, as these demonstrated improved performance. For
– Gaussian blur: Gaussian blurring was applied to some images using .
. more information see the Model selection process section.
a fixed kernel size (with a default value of 15). .
.
– Contrast variation: Contrast variation was introduced to images .
.
.
by modifying the standard deviation of the pixel variation of the .
.
Deep learning optimization specifications
image from the mean, away from its default value. .

Downloaded from https://academic.oup.com/humrep/article/35/4/770/5815143 by guest on 27 October 2023


. Multiple models with a wide range of parameter and hyper-parameter
– Random horizontal and vertical translations (jitter): Randomly .
. settings were trained and evaluated. Optimization protocols that
applied small horizontal and vertical translations (such that the .
. were tested to specify how the value of the learning rate should
blastocyst did not deviate outside the field of view) were used .
. be used during training included stochastic gradient descent (SGD)
.
to assist the model in training invariance to translation or position .
. with momentum (and/or Nesterov accelerated gradients), adaptive
in the image. .
. gradient with delta (Adadelta), adaptive moment estimation (Adam),
– Random compression or jpeg noise: While uncompressed file for- .
. root-mean-square propagation (RMSProp) and limited-memory
mats are preferred for analysis (e.g. ‘png’ format), many embryo .
. Broyden–Fletcher–Goldfarb–Shanno (L-MBFGS). Of these, two well-
images are provided in the common compressed ‘jpeg’ format. .
. known training optimization strategies (optimizers) were selected for
To control for compression artifacts from images of jpeg format, .
. use in the final model; these were SGD (Rumelhart et al., 1986) and
.
jpeg compression noise was randomly applied to some images for .
. Adam (Kingma and Ba, 2014). Optimizers were selected for their ability
training. .
. to drive the update mechanism for the network’s weight parameters
.
. to minimize the objective/loss function.
Model architectures considered .
. Learning rates were evaluated within the range of 1e-5 to 1e-1.
.
A range of deep learning and computer vision/machine learning meth- . Testing of learning rates was conducted with the use of step scheduler,
.
ods were evaluated in training the AI model as follows. The most . which reduces the learning rate during the training progress. Learning
.
effect deep learning architectures for classifying embryo viability were . rates were selected based on their ability to stably converge the model
.
found to be residual networks, such as ResNet-18, ResNet-50 and .
. toward a minimum loss function. The dropout rate, an important
.
ResNet-101 (He et al., 2016), and densely connected networks, such as . technique for preventing over-training for deep learning models, was
.
DenseNet-121 and DenseNet-161 (Huang et al., 2017). These archi- . tested within the range of 0.1 to 0.4. This involved probabilistically
.
tectures were more robust than other types of models when assessed . dropping out nodes in the network with a large number of weight
.
individually. Other deep learning architectures including InceptionV4 . parameters to prevent over-fitting while training. For more information
.
and Inception-ResNetV2 (Szegedy et al., 2016) were also tested but .
. see the Model selection process section.
.
excluded from the final AI model due to poorer individual perfor- . Each deep neural network used weight parameters obtained from
.
mance. Computer vision/machine learning models including support . pre-training on ImageNet, with the final classifier layer replaced with a
.
vector machines (Hearst, 1998) and random forest (Breiman, 2001) . binary classifier corresponding to non-viable and viable classification.
.
with computer vision feature computation and extraction were also . Training of AI models was conducting using PyTorch library (ver-
.
evaluated. However, these methods yielded limited translatability and . sion 0.4.0; Adam Paszke, Sam Gross, Soumith Chintala and Gregory
.
poorer accuracy compared with deep learning methods when evalu- .
. Chanan; 1601 Willow Rd, Menlo Park, CA 94025, USA), with CUDA
.
ated individually, and were therefore excluded from the final AI model . support (version 9; Nvidia Corporation; 2788 San Tomas Expy, Santa
.
ensemble. For more information see the Model selection process . Clara, CA 95051, USA), and OpenCV (version 3.2.0; Willow Garage,
.
section. . Itseez, Inc. and Intel Corporation; 2200 Mission College Blvd. Santa
.
. Clara, CA 95052, USA).
.
Loss functions considered . Individual models were trained and evaluated separately using a train-
.
.
The following quantities were evaluated to select the best model types . validate cycle process as follows:
.
and architectures: .
. – Batches of images were randomly sampled from the training
.
– Model stabilization: How stable the accuracy value was on the . dataset and a viability outcome predicted for each embryo
.
.
validation set over the training process. . image.
.
– Model transferability: How well the accuracy on the training data . – Results for each image were compared to known outcomes to
.
correlated with the accuracy on the validation set. . compute the difference between the prediction and the actual
.
– Prediction accuracy: Which models provided the best validation . outcome (loss).
.
accuracy, for both viable and non-viable embryos, the total com- . – The loss value was then used to adjust the model’s weights to
.
bined accuracy and the balanced accuracy, defined as the weighted . improve its prediction (backpropagation), and the running total
.
.
average accuracy across both class types of embryos. . accuracy was assessed.
776 VerMilyea et al.

.
– This process was repeated thousands of times until the loss was . model across all datasets and was therefore chosen as the preferred
.
reduced as much as possible and the value plateaued. . model.
.
– When all batches in the training dataset had been assessed (i.e. . The final ensemble model includes eight deep learning models of
.
1 epoch), so that the entire training set had been covered, the . which four are zona models and four are full-embryo models. The final
.
training set was re-randomized, and training was repeated. . model configuration used in this study is as follows:
.
.
– After each epoch, the model was run on a fixed subset of images .
. – One full-embryo ResNet-152 model, trained using SGD with
reserved for informing the training process to prevent over- .
. momentum = 0.9, CE loss, learning rate 5.0e-5, step-wise sched-
training (the validation set). .
. uler halving the learning rate every 3 epochs, batch size of 32,
– The train-validate cycle was carried out for 2–100 epochs until a .
. input resolution of 224 x 224 and a dropout value of 0.1.
sufficiently stable model was developed with low loss function. At .
. – One zona model ResNet-152 model, trained using SGD with
the conclusion of the series of train-validate cycles, the highest .
.
. momentum = 0.99, CE loss, learning rate 1.0e-5, step-wise sched-
performing models were combined into a final ensemble model .
. uler dividing the learning rate by 10 every 3 epochs, batch size of
as described below. .

Downloaded from https://academic.oup.com/humrep/article/35/4/770/5815143 by guest on 27 October 2023


. 8, input resolution of 299 x 299 and a dropout value of 0.1.
.
. – Three zona ResNet-152 models, trained using SGD with
.
. momentum = 0.99, CE loss, learning rate 1.0e-5, step-wise
Model selection process .
.
. scheduler dividing the learning rate by 10 every 6 epochs, batch
Evaluation of individual model performance was accomplished using .
. size of 8, input resolution of 299 x 299, and a dropout value of
a model architecture selection process. Only the training and valida- .
. 0.1, one trained with random rotation of any angle.
tion sets were used for evaluation. Each type of prediction model .
. – One full-embryo DenseNet-161 model, trained using SGD with
was trained with various settings of model parameters and hyper- .
. momentum = 0.9, CE loss, learning rate 1.0e-4, step-wise sched-
parameters, including input image resolution, choice of optimizer, .
. uler halving the learning rate every 5 epochs, batch size of 32,
learning rate value and scheduling, momentum value, dropout and .
.
. input resolution of 224 x 224, a dropout value of 0 and trained
initialization of the weights (pre-training). .
. with random rotation of any angle.
After shortlisting model types and loss functions using the criteria .
. – One full-embryo DenseNet-161 model, trained using SGD with
established in the preceding sections, models were separated into two .
. momentum = 0.9, CE loss, learning rate 1.0e-4, step-wise sched-
groups: first, those that included additional image segmentation, and .
. uler halving the learning rate every 5 epochs, batch size of 32,
second those that required the entire unsegmented image. Models .
. input resolution of 299 x 299, a dropout value of 0.
that were trained on images that masked the ICM, exposing the zona .
.
. – One full-embryo DenseNet-161 model, trained using SGD with
region, were denoted as zona models. Models that were trained on .
. momentum = 0.9, Residual CE loss, learning rate 1.0e-4, step-
images that masked the zona (denoted ICM models), and models .
. wise scheduler halving the learning rate every 5 epochs, batch size
that were trained on full-embryo images, were also considered in .
. of 32, input resolution of 299 x 299, a dropout value of 0 and
training. A group of models encompassing contrasting architectures .
. trained with random rotation of any angle.
and pre-processing methods was selected in order to maximize per- .
.
.
formance on the validation set. Individual model selection relied on . The architecture diagram corresponding to ResNet-152, which fea-
.
two criteria, namely diversity and contrasting criteria, for the following . tures heavily in the final model configuration, is shown in Figure 2. A
.
reasons: . flow chart describing the entire model creation and selection method-
.
. ology is shown in Figure 3. The final ensemble model was subsequently
– The diversity criterion drives model selection to include different .
. validated and tested on blind test datasets as described in the results
model’s hyper-parameters and configurations. The reason is that, .
. section.
in practice, similar model settings result in similar prediction out- .
.
.
comes and hence may not be useful for the final ensemble model. .
.
– The contrasting criterion drives model selection with diverse . Statistical analysis
.
prediction outcome distributions, due to different input images .
. Measures of accuracy used in the assessment of model behavior on
or segmentation. This approach was supported by evaluating .
. data included sensitivity, specificity, overall accuracy, distributions of
performance accuracies across individual clinics. This method .
. predictions and comparison to embryologists’ scoring methods. For the
.
ensured translatability by avoiding selection of models that .
. AI model, an embryo viability score of 50% and above was considered
performed well only on specific clinic datasets, thus preventing .
. viable, and below 50% non-viable. Accuracy in identification of viable
over-fitting. .
. embryos (sensitivity) was defined as the number of embryos that the
.
The final prediction model was an ensemble of the highest per- . AI model identified as viable divided by the total number of known
.
forming individual models (Rokach, 2010). Well-performing individual . viable embryos that resulted in a positive clinical pregnancy. Accuracy
.
.
models that exhibited different methodologies, or extracted different . in identification of non-viable embryos (specificity) was defined as the
.
biases from the features obtained through machine learning, were . number of embryos that the AI model identified as non-viable divided
.
combined using a range of voting strategies based on the confidence . by the total number of known non-viable embryos that resulted in
.
of each model. Voting strategies evaluated included mean, median, . a negative clinical pregnancy outcome. Overall accuracy of the AI
.
max and majority mean voting. It was found that the majority mean . model was determined using a weighted average of sensitivity and
.
voting strategy outperformed other voting strategies for this par- . specificity, and percentage improvement in accuracy of the AI model
.
.
ticular ensemble model. This voting strategy gave the most stable . over the embryologist was defined as the difference in accuracy as a
Artificial intelligence for embryo viability assessment 777

Downloaded from https://academic.oup.com/humrep/article/35/4/770/5815143 by guest on 27 October 2023


Figure 2 Example illustration of ResNet-152 neural network layers. The layer diagram from input to prediction for a neural network of
type ResNet-152, which features prominently in the final Life Whisperer artificial intelligence (AI) model, is shown. For the 152 layers, the number of
convolutional layers (‘conv’) are depicted, along with the filter size, which is the receptive region taken by each convolutional layer. Two-dimensional
maxpooling layers (‘pool’) are also shown, with a final fully connected (FC) layer, which represents the classifier, with a binary output for prediction
(non-viable and viable).

proportion of the original embryologist accuracy (i.e. (AI_accuracy— . A binary confusion matrix containing class accuracy measures, i.e.
.
.
embryologist_accuracy)/embryologist_accuracy). . sensitivity and specificity, was also used in model assessment. The con-
.
For these analyses, embryologist scores corresponding to blastocyst . fusion matrix evaluated model classification and misclassification based
.
assessment at Day 5 were provided, that is, their assessment was . on true positives, false negatives, false positives and true negatives.
.
provided at the same point in time as when the image was taken. This . These numbers were depicted visually using tables or ROC plots where
.
ensured that the time point for model assessment and the embryol- . applicable.
.
.
ogist’s assessment were consistent. Note that the subset of data that . Final model accuracy was determined using results from blind
.
includes corresponding embryologist scores was sourced from a range . datasets only, as these consisted of completely independent ‘unseen’
.
of clinics, and thus the measurement of the embryologist grading accu- . datasets that were not used in model training or validation. In general,
.
racy varied across each clinic, from 43.9% to 55.3%. This is due to the . the accuracy of any validation dataset will be higher than that of a blind
.
variation in embryologist skill, and statistical fluctuation of embryologist . dataset, as the validation dataset is used to guide training and selection
.
scoring methods across the dataset. In order to provide a comparison . of the AI model. For a true, unbiased measure of accuracy, only blind
.
.
that ensured the most representative distribution of embryologist skill . datasets were used. The number of replicates used for determination
.
levels, all embryologist scores were considered across all clinics, and . of accuracy was defined as the number of completely independent
.
combined in an unweighted manner, instead of considering accuracies . blind test sets comprising images that were not used in training the AI
.
from individual clinics. This approach therefore captured the inherent . model. Double-blind test sets, consisting of images provided by clinics
.
diversity in embryologist scoring efficacy. . that did not provide any data for model training, were used to evaluate
.
The distributions of prediction scores for both viable and non- . whether the model had been over-trained on data provided by the
.
.
viable embryo images were used to determine the ability of the . original clinics.
.
AI model to separate true positives from false negatives, and true .
.
negatives from false positives. AI model predictions were normalized .
.
between 0 and 1, and interpreted as confidence scores. Distributions .
.
Results
were presented as histograms based on the frequency of confidence .
.
scores. Bi-modal distributions of predictions indicated that true pos-
.
. Datasets used in model development
.
itives and false negatives, or true negatives and false positives, were . Model development was divided into two distinct studies. The first
.
separated with a degree of confidence, meaning that the predictive . study consisted of a single-site pilot study to determine the feasibility
.
power of the model on a given dataset was less likely to have been . of creating an AI model for prediction of embryo viability, and refine
.
obtained by chance. Alternatively, slight asymmetry in a unimodal . the training techniques and principles to be adopted for a second multi-
.
Gaussian-like distribution falling on either side of a threshold indicated . site study. The first study, or pilot study, was performed using a total of
.
.
that the model was not easily able to separate distinct classes of . 5282 images provided by a single clinic in Australia, with 3892 images
.
embryo. . used for the training process. The AI model techniques explored in
778 VerMilyea et al.

Downloaded from https://academic.oup.com/humrep/article/35/4/770/5815143 by guest on 27 October 2023


Figure 3 Flow chart for model creation and selection methodology. The model creation methodology is depicted beginning from data
collection (top). Each step summarizes the component tasks that were used in the development of the final AI model. After image processing and
segmentation, the images were split into datasets and the training dataset prepared by image augmentation. The highest performing individual models
were considered candidates for inclusion in the final ensemble model, and the final ensemble model was selected based using majority mean voting
strategy.

the pilot study were then further developed in a second, pivotal study . 1 comprised images from the same clinics that provided images for
.
.
to determine generalizability to different clinical environments. The . training. Blind Test Sets 2 and 3 were, however, provided by completely
.
pivotal study used a total of 3604 images provided by 11 clinics from . independent clinics that did not provide any data for training. Thus,
.
across the USA, Australia and New Zealand, with a total of 1744 . Blind Test Sets 2 and 3 represented double-blinded datasets as relates
.
images of Day 5 embryos used for training the Life Whisperer AI model . to AI computational methods.
.
presented in this article. . In total, 52.5% of all images in the blind datasets had embryologist
.
Images were split into defined datasets for model development in . grades available for comparison of outcome. Note that embryologist’s
.
.
each study, which included a training dataset, a validation dataset and . grades were not available for Blind Test Set 2 in the pivotal study;
.
multiple blind test sets. Figure 4 depicts the number and origin of . therefore, n = 2 for both studies in comparison of AI model accuracy
.
images that were used in each dataset in both the pilot and pivotal . to that of embryologists.
.
studies. A significant proportion of images in each study were used in .
.
model training, with a total of 3892 images used in the pilot study, and a .
.
further 1744 images used in the pivotal study. AI models were selected . Pilot feasibility study
.
.
and validated using validation datasets, which contained 390 images in . Table I shows a summary of results for the pilot study presented
.
the pilot study and 193 in the pivotal study. Accuracy was determined . according to dataset (validation dataset, individual blind test sets and
.
using blind datasets only, comprising a total of 1000 images in the pilot . combined blind test dataset). In this study, negative pregnancies were
.
study, and 1667 images in the pivotal study. Two independent blind test . found to outweigh positive pregnancies by approximately 3-fold. Sen-
.
sets were evaluated in the pilot study; these were both provided by . sitivity of the Life Whisperer AI model for viable embryos was 74.1%,
.
. and specificity for non-viable embryos was 65.3%. The greater sen-
the same clinic that provided images for training. Three independent .
.
blind test sets were evaluated in the pivotal study. Blind Test Set . sitivity compared to specificity was to be expected, as it reflects the
Artificial intelligence for embryo viability assessment 779

Downloaded from https://academic.oup.com/humrep/article/35/4/770/5815143 by guest on 27 October 2023


Figure 4 Image datasets used in AI model development and testing. A total of 8886 images of Day 5 embryos with matched clinical
pregnancy outcome data were obtained from 11 independent IVF clinics across the USA, Australia and New Zealand. The pilot (feasibility) study to
develop the initial AI model utilized 5282 images from a single clinic in Australia. This model was further developed in the pivotal study, which utilized
an additional 3604 images from all 11 clinics. Blind test sets were used to determine AI model accuracy.

intended bias to grade embryos as viable that was introduced during . similar to the initial accuracy values obtained in the pilot study, although
.
model development. Overall accuracy for the AI model was 67.7%. . values were marginally lower—this was not unexpected due to the
.
.
For the subset of images that had embryologist’s scores available, the . introduction of inter-clinic variation into the AI development. Note
.
AI model provided an average accuracy improvement of 30.8% over . that while the sensitivity in Blind Test Set 3 was ∼5–7% lower than
.
embryologist’s grading for viable/non-viable predictions (P = 0.021, .
. that of Blind Test Sets 1 and 2, the specificity was ∼8% higher than in
n = 2, Student’s t test). The AI model correctly predicted viability . those datasets, making the overall accuracy comparable across all three
.
over the embryologist 148 times, whereas the embryologist correctly . blind test sets. The overall accuracy in each blind test set was >63%,
.
predicted viability over the model 54 times, representing a 2.7-fold . with a combined overall accuracy of 64.3%.
.
.
improvement for the AI model. . Binary comparison of viable/non-viable embryo classification
.
The AI model developed in this pilot study was used as a basis for . demonstrated that the AI model provided an average accuracy
.
further development in the pivotal study described below. . improvement of 24.7% over embryologist’s grading (P = 0.047, n = 2,
.
. Student’s t test). The AI model correctly predicted viability over
.
. the embryologist 172 times, whereas the embryologist correctly
.
Model accuracy and generalizability . predicted viability over the model 78 times, representing a 2.2-fold
.
.
The results of the pivotal study are presented in Table II. In this study, . improvement for the AI model. Comparison to embryologist’s scores
.
the distribution of negative and positive pregnancies was more even . using the 5-band ranking system approach showed that the AI model
.
than in the pilot study, with negative pregnancies occurring ∼50% more . was correct over embryologists for 40.6% of images, and incorrect
.
often than positive pregnancies (1.5-fold increase compared to 3-fold . compared to embryologist’s scoring for 28.6% of images, representing
.
. an improvement of 42.0% (P = 0.028, n = 2, Student’s t test).
increase in the pilot study). .
.
After further development using data from a range of clinics, the Life . Confusion matrices showing the total number of true positives, false
.
Whisperer AI model showed a sensitivity of 70.1% for viable embryos, . positives, false negatives and true negatives obtained from embryol-
.
and a specificity of 60.5% for non-viable embryos. This was relatively . ogist grading methods and the AI model are shown in Figure 5. By
780 VerMilyea et al.

.
. As a final measure of AI model performance, the distributions of
.
. prediction scores for viable and non-viable embryos were graphed to
.
. evaluate the ability of the model to separate correctly from incorrectly
.
. identified embryo images. The histograms for distributions of predic-
.
. tion scores are presented in Figure 7. The shapes of the histograms
.
. demonstrate clear separation between correctly and incorrectly iden-
.
.
. tified viable or non-viable embryo images.
.
.
.
.
.
.
.
.
.
Discussion
.
. In these studies, Life Whisperer used ensemble modeling to combine
.
. computer vision methods and deep learning neural network techniques

Downloaded from https://academic.oup.com/humrep/article/35/4/770/5815143 by guest on 27 October 2023


.
. to develop a robust image analysis model for the prediction of human
.
. embryo viability. In the initial pilot study, an AI-based model was cre-
.
. ated that not only matched the prediction accuracy of trained embry-
.
.
. ologists, but in fact surpassed the original objective by demonstrating
.
. an accuracy improvement of 30.8%. The AI-based model was further
.
. developed in a pivotal study to extend generalizability and transferabil-
.
. ity to multiple clinical environments in different geographical locations.
.
. Model accuracy was marginally lower on further development due
.
Figure 5 Confusion matrix of the pivotal study for embry- . to the introduction of inter-clinic variability, which may have affected
.
ologist and AI model grading. True positives (TP), false positives .
. efficacy due to varying patient demographics (age, health, ethnicity,
(FP), false negatives (FN) and true negatives (TN) are shown. The .
. etc.), and divergent standard operating procedures (equipment and
embryologists’ confusion matrix is depicted on the top panel, and .
. methods used for embryo culture and image capture). Variation was
the AI model’s confusion matrix is depicted on the bottom panel. .
. also likely introduced due to embryologists being trained differently in
The embryologists’ overall accuracy is significantly lower, despite a .
. embryo-scoring methods. However, the final AI model is both robust
relatively higher sensitivity, due to the enhanced specificity of the AI .
. and accurate, demonstrating a significant improvement of 24.7% over
model’s predictions. Clin. Preg. = clinical pregnancy; No Clin. Preg. = .
.
no clinical pregnancy. . the predictive accuracy of embryologists for binary viable/non-viable
.
. classification, despite variability of the clinical environments tested. The
.
. overall accuracy for prediction of embryo viability was 64.3%, which
.
. was considered relatively high given that research studies suggest a
comparing the embryologist and AI model results, it is clear that the .
. theoretical maximum accuracy of 80%, with ∼20% of IVF cases thought
embryologist accuracy overall is significantly lower, even though the .
.
sensitivity is higher. This is most likely due to the fact that embry- . to fail due to factors unrelated to embryo viability (e.g. operational
.
ologist scores are typically high for the sub-class of embryos that . errors, patient-related health factors, etc.).
.
have been implanted, and therefore there is a natural bias in the . Confusion matrices and comparison of the distribution of viability
.
dataset toward embryos that have a high embryologist score. While . rankings highlighted the tendency for embryologists to classify embryos
.
. as viable, as it is generally considered preferable to allow a non-
alteration of the embryologist threshold score of ‘3BB’ above which .
. viable embryo to be transferred than to allow a viable embryo to
embryos are considered ‘likely viable’ does not result in greater embry- .
.
ologist accuracy, a significant proportion of the embryos considered, . be discarded. During development, the AI model was intentionally
.
(114 + 121)/(134 + 128) = 89.7%, were graded equal to or higher .
.
biased to similarly minimize the occurrence of false negatives; this
than 3BB. While the AI Model showed a reduction in true positives . was reflected in the slightly higher accuracy for viable embryos than
.
compared to the embryologist, there was a significant improvement . non-viable embryos (70.1% and 60.5% for sensitivity and specificity,
.
. respectively). By examining the distribution of viability rankings for the
in specificity. The AI model demonstrated an excess of 60% for .
. AI model on the validation set, it was demonstrated that the model
both sensitivity and specificity, while still retaining a bias toward high .
.
sensitivity, in order to minimize the number of false negatives. . was able to distinctly separate predictions of viable and non-viable
.
A visual representation of the distribution of rankings from embry- . embryos on the blind test sets. Furthermore, graphical representa-
.
ologists and from the AI model is shown in Figure 6. The histograms . tion of the distribution of predictions for both viable and non-viable
.
differ from each other in the shape of their distribution. There is a . embryos demonstrated a clear separation of correct and incorrect
.
. predictions (for both viable and non-viable embryos, separately) by the
clear dominance in the embryologist’s scores around a rank value of .
.
3, dropping off steeply for lower scores of 1 and 2, which reflects the . AI model.
.
tendency of embryologists to grade in the average to above average . Machine learning methods have recently come into the spotlight for
.
range. By comparison, the AI model demonstrated a smaller peak for . various medical imaging diagnostic applications. In particular, several
.
rank values of 3, and larger peaks for rank values of 2 and 4. This . groups have published research describing the use of either con-
.
reflects the AI model’s ability to distinctly separate predictions of viable . ventional machine learning or AI image analysis techniques to auto-
.
and non-viable embryos, suggesting that the model provides a more . mate embryo classification. Two recent studies described conventional
.
.
granular scoring range across the different quality bands. . algorithms for prediction of blastocyst formation rather than clinical
Artificial intelligence for embryo viability assessment 781

Downloaded from https://academic.oup.com/humrep/article/35/4/770/5815143 by guest on 27 October 2023


Figure 6 Distribution of viability rankings demonstrates the ability of the AI model to distinctly separate viable from non-viable
human embryos. The left panel depicts the frequency of embryo viability rankings according to embryologist’s scores, and the right panel depicts the
frequency of viability rankings according to AI model predictions. Results are shown for Blind Test Set 1. Y-axis = % of images in rank; x-axis = ranking
band (1 = lowest predicted viability, 5 = highest predicted viability).

.
pregnancy. These studies achieved 76.4% (Segal et al., 2018) and >93% . outcome, the IVY AI was trained on a heavily biased dataset of only
.
(Wong et al., 2010) accuracy, respectively, in their overall classification . 694 cases (8%) of positive pregnancy outcomes, with 8142 negative
.
objectives, which included prediction of blastocyst formation based on . outcome cases (92%). Additionally, 87% (7063 cases) of the negative
.
Day 2 and/or Day 3 morphology and a number of other independent . outcome cases were from embryos that were never transferred to a
.
data points. However, it is important to note that blastocyst formation . patient, discarded based on abnormal morphology considerations or
.
is not a reliable indicator of the probability of clinical pregnancy, and . aneuploidy, and therefore the ground-truth clinical pregnancy outcome
.
.
therefore the utility of this approach for prediction of pregnancy . cannot be known. The approach used to train the IVY AI only used
.
outcome is limited. . ground-truth pregnancy outcome for a very small proportion of the
.
As discussed earlier, three recent studies described development of . algorithm training and thus has a heavy inherent bias toward the embry-
.
AI-based systems for classification of embryo quality (Khosravi et al., . ologist assessment for negative outcome cases. Although somewhat
.
2019; Kragh et al., 2019; Tran et al., 2019). All three studies utilized . predictive of pregnancy outcome, the accuracy of the AI has not truly
.
.
images taken by time-lapse imaging systems, which likely standardized . been measured on ground-truth outcomes of clinical pregnancy, and
.
the quality of images provided for analysis compared to those obtained . gives a false representation of the true predictive accuracy of the AI,
.
by standard optical light microscopy. Khosravi et al. (2019) reported . which can only be truly assessed on an AI model that has been trained
.
an accuracy of 97.5% for their model STORK in predicting embryo . exclusively on known fetal heartbeat outcome data.
.
grade. Their model was not, however, developed to predict clinical . The works discussed above are not experimentally comparable with
.
pregnancy outcome. The high reported accuracy in this case may be . the current study, as they generally relate to different endpoints; for
.
.
attributed to the fact that the analysis was limited to classification of . example, the prediction of blastocyst formation at Day 5 starting
.
poor versus good quality embryos—fair quality embryos in between . from an image at Day 2 or Day 3 post-IVF. While there is some
.
were excluded from analysis. Similarly, Kragh et al. (2019) reported . benefit in these methods, they do not provide any power in pre-
.
accuracies of 71.9% and 76.4% for their model in grading embryonic . dicting clinical pregnancy, in contrast to the present study evaluating
.
ICM and trophectoderm, respectively, according to standard morpho- . the Life Whisperer model. Other studies have shown that a high
.
logical grading methods. This was shown to be at least as good as . level of accuracy can be achieved through the use of AI in repli-
.
.
the performance of trained embryologists. The authors also evaluated . cating embryologist scoring methods (Khosravi et al., 2019; Kragh
.
predictive accuracy for implantation, for which data were available for . et al., 2019); however, the work presented here has shown that
.
a small cohort of images. The AUC for prediction of implantation was . the accuracy of embryologist grading methods in predicting clinical
.
not significantly different to that of embryologists (AUC of 0.66 and . pregnancy rates is in actuality fairly low. An AI model trained to
.
0.64, respectively), and therefore, this model has limited ability for . replicate traditional grading methods to a high degree of accuracy
.
.
prediction of pregnancy outcome. . may be useful for automation and standardization of grading, but it
.
The approach taken by Tran et al. (2019) for development of their . can, at best, only be as accurate as the grading method itself. In
.
AI model IVY used deep learning to analyze time-lapse embryo images . the current study, only ground-truth outcomes for fetal heartbeat
.
to predict pregnancy success rates. This study used 10 683 embryo . at first scan were used in the training, validation and testing of the
.
images from 1648 individual patients throughout the course of the . model. Given the nature of predicting implantation based on embryo
.
training and development of IVY, with 8836 embryos coded as positive . morphology, which will necessarily be confounded by patient factors
.
.
or negative cases. Of note, although developed to predict pregnancy . beyond the scope of morphological assessment, it would be expected
782 VerMilyea et al.

Downloaded from https://academic.oup.com/humrep/article/35/4/770/5815143 by guest on 27 October 2023


Figure 7 Distributions of prediction scores show the separation of correct from incorrect predictions by the AI model.Distributions
of prediction scores are presented for Blind Test Set 1 (A), Blind Test Set 2 (B) and Blind Test Set 3 (C). The left panel in each set depicts the frequency
of predictions presented as confidence intervals for viable embryos. True positives where the model was correct are marked in blue, and false negatives
where the model was incorrect are marked in red. The right panel in each set depicts the frequency of predictions presented as confidence intervals for
non-viable embryos. True negatives where the model was correct are marked in green, and false positives where the model was incorrect are marked
in orange.

.
that the overall accuracy of the Life Whisperer AI model would be . study (∼50%), and a theoretical maximum accuracy of 80%, Life
.
lower than alternative endpoints but more clinically relevant. For the . Whisperer’s AI model accuracy of ∼65% represents a significant and
.
first time, this study presents a realistic measurement of AI accu- . clinically relevant improvement for predicting embryo viability in this
.
racy for embryo assessment and a true representation of predictive . domain.
.
ability for the pregnancy outcome endpoint. Given the relatively low . The present study demonstrated that the Life Whisperer AI model
.
.
accuracy for embryologists in predicting viability, as shown in this . provided suitably high sensitivity, specificity, and overall accuracy levels
Artificial intelligence for embryo viability assessment 783

.
for prediction of embryo viability based directly on ground-truth clinical .
. Acknowledgements
pregnancy outcome by indication of positive fetal cardiac activity on .
. The authors acknowledge the kind support of investigators and
ultrasound. The model was able to predict embryo viability by analysis .
.
of images obtained using standard optical light microscope systems, . collaborating clinics for providing embryo images and associated
.
which are utilized by the majority of IVF laboratories and clinics world- . data as follows: Hamish Hamilton and Michelle Lane, Monash IVF
.
. Group/Repromed (Adelaide SA, Australia); Matthew ‘Tex’ VerMilyea
wide. AUC/ROC was not used as a primary methodology for eval- .
. and Andrew Miller, Ovation Fertility (Austin TX and San Antonio TX,
uation of accuracy due to inherent limitations of the approach when .
. USA); Bradford Bopp, Midwest Fertility Specialists (Carmel IN, USA);
applied to largely unbalanced datasets, such as those used in develop- .
. Erica Behnke, Institute for Reproductive Health (Cincinnati OH, USA);
ment of IVY (which used a dataset with a ∼13:1 ratio of negative to .
.
positive clinical pregnancies) (Tran et al., 2019). Nevertheless, the ROC . Dean Morbeck, Fertility Associates (Auckland, Christchurch, Dunedin,
.
curve for the Life Whisperer AI model is presented for completeness in . Hamilton and Wellington, New Zealand); and Rebecca Matthews,
.
. Oregon Reproductive Medicine (Portland OR, USA).
Supplementary Figure SI with results demonstrating an improved AUC .
.
for the AI model when compared to embryologist’s scores. .

Downloaded from https://academic.oup.com/humrep/article/35/4/770/5815143 by guest on 27 October 2023


.
The unique power of the Life Whisperer AI model developed here .
lies in the use of ensemble modeling to combine computer vision image
.
. Authors’ roles
.
processing methods and multiple deep learning AI techniques to iden- . M.V., J.M.M.H., D.P. and M.P. conceived the study and designed
.
.
tify morphological features of viability that are not readily discernible to . methodology. D.P. and M.P. were also responsible for project
.
the human eye. The Life Whisperer AI model was trained on images of . management and supervision of research activity. M.V., A.M. and A.P.
.
Day 5 blastocysts at all stages including early, expanded, hatching and . provided significant resources. J.M.M.H., A.J., T.N. and A.P.M. were
.
hatched blastocysts, and as such it can be used to analyze all stages of . responsible for data curation, performing the research, formal analysis
.
blastocyst development. One potential limitation of the AI model as it . and software development. S.M.D. was involved in data visualization
.
currently stands is that it does not incorporate additional information . and presentation, and in writing the original manuscript draft. All the
.
.
from different days of embryo development. Emerging data using time- . authors contributed to the review and editing of the final manuscript.
.
lapse imaging systems suggest that certain aspects of developmental .
.
kinetics in culture may correlate with embryo quality (Gardner et al., .
.
.
2015). Therefore, it would be of interest to evaluate or modify the .
.
Funding
ability of the Life Whisperer AI model to extend to additional time .
points during embryo development. It would also be of interest to . Life Whisperer Diagnostics, Pty Ltd is a wholly owned subsidiary of the
.
. parent company, Presagen Pty Ltd. Funding for the study was provided
evaluate alternative pregnancy endpoints, such as live birth outcome, .
. by Presagen with grant funding received from the South Australian
as fetal heartbeat is not an absolute indicator of live birth. However, it is .
. Government: Research, Commercialisation and Startup Fund (RCSF).
important to note that the endpoint of live birth is additionally affected .
.
by patient-related confounding factors. The current investigation was . ‘In kind’ support and embryology expertise to guide algorithm devel-
.
performed with retrospectively collected data, and hence it will be of . opment were provided by Ovation Fertility.
.
.
importance to collect data prospectively to assess real-world use of the .
.
AI model. Additional data collection and analysis is expected to further .
.
improve the accuracy of the AI. .
.
Conflict of interest
The AI model developed here has been incorporated into a cloud- .
. J.M.M.H., D.P. and M.P. are co-owners of Life Whisperer Diagnostics,
based software application that is globally accessible via the web. The .
. Pty Ltd, and of the parent company Presagen, Pty Ltd. Presagen
Life Whisperer software application allows embryologists or similarly .
. has filed a provisional patent for the technology described in this
.
qualified personnel to upload images of embryos using any computer .
. manuscript (52985P pending). A.P.M. owns stock in Life Whisperer,
or mobile device, and the AI model will instantly return a viability .
. and S.M.D., A.J., T.N. and A.P.M. are employees of Life Whisperer.
confidence score. The benefits of this approach lie in its simplicity and .
.
ease of use; the Life Whisperer system will not require installation of .
.
complex or expensive equipment, and does not require any specific .
.
computational or analytical knowledge. Additionally, the use of this tool .
.
References
.
will not require any substantial change in standard operating procedures . Annan JJ, Gudi A, Bhide P, Shah A, Homburg R. Biochemical pregnancy
.
for IVF laboratories; embryo images are routinely taken as part of IVF . during assisted conception: a little bit pregnant. J Clin Med Res
.
laboratory standard procedures, and analysis can be performed at . 2013;5:269–274.
.
the time of image capture from within the laboratory to help decide . Breiman L. Random forests. Machine Learning 2001;45:5–32.
.
which embryos to transfer, freeze or discard. The studies described . Chen M, Wei S, Hu J, Yuan J, Liu F. Does time-lapse imaging have
.
.
herein support the use of the Life Whisperer AI model as a clinical . favorable results for embryo incubation and selection compared with
.
decision support tool for prediction of embryo viability during IVF . conventional methods in clinical in vitro fertilization? A meta-analysis
.
procedures. . and systematic review of randomized controlled trials. PLoS One
.
. 2017;12: e0178720.
.
. Gardner DK, Meseguer M, Rubio C, Treff NR. Diagnosis of human
Supplementary data .
. preimplantation embryo viability. Hum Reprod Update 2015;21:
.
.
Supplementary data are available at Human Reproduction online. . 727–747.
784 VerMilyea et al.

Gardner DK, Sakkas D. Assessment of embryo viability: the ability . Sahlsten J, Jaskari J, Kivinen J, Turunen L, Jaanio E, Hietala K,
.
to select a single embryo for transfer—a review. Placenta 2003;24: . Kaski K. Deep learning fundus image analysis for diabetic
.
.
S5–S12. . retinopathy and macular edema grading. Sci Rep 2019;9:
.
GBD. Population and fertility by age and sex for 195 countries and . 10750.
.
territories, 1950–2017: a systematic analysis for the Global Burden . Segal TR, Epstein DC, Lam L, Liu J, Goldfarb JM, Weinerman R.
.
of Disease Study 2017. Lancet 2018;392:1995–2051. . Development of a decision tool to predict blastocyst formation. Fertil
.
He K, Zhang X, Ren S, Sun J. Deep residual learning for image recog- . Steril 2018;109:e49–e50.
.
.
nition. IEEE Conference on Computer Vision and Pattern Recognition . Storr A, Venetis CA, Cooke S, Kilani S, Ledger W. Inter-observer and
.
(CVPR), 27–30 June, Piscataway NJ, US: Institute of Electrical and . intra-observer agreement between embryologists during selection
.
Electronics Engineers, 2016;770–778. . of a single Day 5 embryo for transfer: a multicenter study. Hum
.
Hearst MA. Support vector machines. IEEE Intell Syst 1998;13:18–28. . Reprod 2017;32:307–314.
.
Huang G, Liu Z, Maaten LVD, Weinberger KQ. Densely connected . Szegedy C, Ioffe S, Vanhoucke V. Inception-v4, inception-ResNet
.
convolutional networks. IEEE Conference on Computer Vision and . and the impact of residual connections on learning. Proceedings

Downloaded from https://academic.oup.com/humrep/article/35/4/770/5815143 by guest on 27 October 2023


.
.
Pattern Recognition (CVPR), 21–26 July, Piscataway NJ, US: Institute . of the Thirty-First AAAI Conference on Artificial Intelligence (AAAI-
.
of Electrical and Electronics Engineers, 2017;2261–2269. . 17), 4-9 February, Palo Alto CA, USA: AAAI Press, 2016;2017:
.
Khosravi P, Kazemi E, Zhan Q, Malmsten JE, Toschi M, Zisimopoulos P, . 4278–4284.
.
Sigaras A, Lavery A, LAD C, Hickman C et al. Deep learning enables . Tran D, Cooke S, Illingworth PJ, Gardner DK. Deep learning
.
robust assessment and selection of human blastocysts after in vitro . as a predictive tool for fetal heart pregnancy following time-
.
fertilization. NPJ Digit Med 2019;2:21. . lapse incubation and blastocyst transfer. Hum Reprod 2019;34:
.
.
Kingma D, Ba J. Adam: a method for stochastic optimization. Computing . 1011–1018.
.
Research Repository (CoRR) 2014; abs/1412.6980. . Wang J, Sauer MV. In vitro fertilization (IVF): a review of 3 decades
.
Kragh MF, Rimestad J, Berntsen J, Karstoft H. Automatic grading . of clinical innovation and technological advancement. Ther Clin Risk
.
of human blastocysts from time-lapse imaging. Comput Biol Med . Manag 2006;2:355–364.
.
2019;115:103494. . Wong CC, Loewke KE, Bossert NL, Behr B, De Jonge CJ, Baer TM,
.
Rokach L. Ensemble-based classifiers. Artif Intell Rev 2010;33:1–39. . RA RP. Non-invasive imaging of human embryos before embryonic
.
.
Rumelhart DE, Hinton GE, Williams RJ. Learning representations by . genome activation predicts development to the blastocyst stage. Nat
.
back-propagating errors. Nature 1986;323:533–536. . Biotechnol 2010;28:1115–1121.

You might also like