Embryo Classification Paper2
Embryo Classification Paper2
770–784, 2020
Advance Access Publication on April 2, 2020 doi:10.1093/humrep/deaa013
Development of an artificial
intelligence-based assessment model
for prediction of embryo viability using
static images captured by optical light
Submitted on October 13, 2019; resubmitted on December 23, 2019; editorial decision on January 16, 2020
STUDY QUESTION: Can an artificial intelligence (AI)-based model predict human embryo viability using images captured by optical light
microscopy?
SUMMARY ANSWER: We have combined computer vision image processing methods and deep learning techniques to create the non-
invasive Life Whisperer AI model for robust prediction of embryo viability, as measured by clinical pregnancy outcome, using single static images
of Day 5 blastocysts obtained from standard optical light microscope systems.
WHAT IS KNOWN ALREADY: Embryo selection following IVF is a critical factor in determining the success of ensuing pregnancy.
Traditional morphokinetic grading by trained embryologists can be subjective and variable, and other complementary techniques, such as
time-lapse imaging, require costly equipment and have not reliably demonstrated predictive ability for the endpoint of clinical pregnancy. AI
methods are being investigated as a promising means for improving embryo selection and predicting implantation and pregnancy outcomes.
STUDY DESIGN, SIZE, DURATION: These studies involved analysis of retrospectively collected data including standard optical light
microscope images and clinical outcomes of 8886 embryos from 11 different IVF clinics, across three different countries, between 2011 and
2018.
PARTICIPANTS/MATERIALS, SETTING, METHODS: The AI-based model was trained using static two-dimensional optical light
microscope images with known clinical pregnancy outcome as measured by fetal heartbeat to provide a confidence score for prediction of
pregnancy. Predictive accuracy was determined by evaluating sensitivity, specificity and overall weighted accuracy, and was visualized using
histograms of the distributions of predictions. Comparison to embryologists’ predictive accuracy was performed using a binary classification
approach and a 5-band ranking comparison.
MAIN RESULTS AND THE ROLE OF CHANCE: The Life Whisperer AI model showed a sensitivity of 70.1% for viable embryos while
maintaining a specificity of 60.5% for non-viable embryos across three independent blind test sets from different clinics. The weighted overall
accuracy in each blind test set was >63%, with a combined accuracy of 64.3% across both viable and non-viable embryos, demonstrating model
robustness and generalizability beyond the result expected from chance. Distributions of predictions showed clear separation of correctly and
incorrectly classified embryos. Binary comparison of viable/non-viable embryo classification demonstrated an improvement of 24.7% over
embryologists’ accuracy (P = 0.047, n = 2, Student’s t test), and 5-band ranking comparison demonstrated an improvement of 42.0% over
embryologists (P = 0.028, n = 2, Student’s t test).
† The authors consider that the first two authors should be regarded as joint first authors.
© The Author(s) 2020. Published by Oxford University Press on behalf of the European Society of Human Reproduction and Embryology.
This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/4.0/), which per-
.
. work is properly cited. For commercial re-use, please contact [email protected]
mits non-commercial re-use, distribution, and reproduction in any medium, provided the original
.
.
.
.
.
.
.
Artificial intelligence for embryo viability assessment 771
LIMITATIONS, REASONS FOR CAUTION: The AI model developed here is limited to analysis of Day 5 embryos; therefore, further
evaluation or modification of the model is needed to incorporate information from different time points. The endpoint described is clinical
pregnancy as measured by fetal heartbeat, and this does not indicate the probability of live birth. The current investigation was performed with
retrospectively collected data, and hence it will be of importance to collect data prospectively to assess real-world use of the AI model.
WIDER IMPLICATIONS OF THE FINDINGS: These studies demonstrated an improved predictive ability for evaluation of embryo
viability when compared with embryologists’ traditional morphokinetic grading methods. The superior accuracy of the Life Whisperer AI
model could lead to improved pregnancy success rates in IVF when used in a clinical setting. It could also potentially assist in standardization of
embryo selection methods across multiple clinical environments, while eliminating the need for complex time-lapse imaging equipment. Finally,
the cloud-based software application used to apply the Life Whisperer AI model in clinical practice makes it broadly applicable and globally
scalable to IVF clinics worldwide.
STUDY FUNDING/COMPETING INTEREST(S): Life Whisperer Diagnostics, Pty Ltd is a wholly owned subsidiary of the parent
company, Presagen Pty Ltd. Funding for the study was provided by Presagen with grant funding received from the South Australian Government:
Key words: assisted reproduction / embryo quality / IVF/ICSI outcome / artificial intelligence / machine learning
.
Introduction .
.
approach to aid in embryo selection during IVF, using single static
. two-dimensional images captured by optical light microscopy methods.
.
With global fertility generally declining (GBD, 2018), many couples and . The aim was to combine computer vision image processing methods
.
individuals are turning to assisted reproduction procedures for help . and deep learning to create a robust model for analysis of Day 5
.
with conception. Unfortunately, success rates for IVF are quite low at . embryos (blastocysts) for prediction of clinical pregnancy outcomes.
.
∼20–30% (Wang and Sauer, 2006), placing significant emotional and . This is the first report of an AI-based embryo selection method that
.
financial strain on those seeking to achieve a pregnancy. During the .
. can be used for analysis of images taken using standard optical light
IVF process, one of the critical determinants of a successful pregnancy .
. microscope systems, without requiring time-lapse imaging equipment
.
is embryo quality, and the embryo selection process is essential for . for operation, and which is predictive of pregnancy outcome. Using an
.
ensuring the shortest time to pregnancy for the patient. There is a . AI screening method to improve selection of embryos prior to transfer
.
pressing motivation to improve the way in which embryos are selected . has the potential to improve IVF success rates in a clinical setting.
.
for transfer into the uterus during the IVF process. .
. A common challenge in evaluating AI and machine learning methods
Currently, embryo selection is a manual process involving assessment .
. in the medical industry is that each clinical domain is unique, and
of embryos by trained clinical embryologists, through visual inspection .
. requires a specialized approach to address the issue at hand. There
of morphological features using an optical light microscope. The most .
. is a tendency for industry to compare the accuracy of AI in one
.
common scoring system used by embryologists is the Gardner Scale .
. clinical domain to another, or to compare the accuracy of different AI
(Gardner and Sakkas, 2003), in which morphological features such .
. approaches within a domain that assess different endpoints. These are
as inner cell mass (ICM) quality, trophectoderm quality and embryo .
. not valid comparisons as they do not consider the clinical context, nor
developmental advancement are evaluated and graded according to .
. the relevance of the ground-truth endpoint used for the assessment
an alphanumeric scale. One of the key challenges in embryo grading .
. of the AI. Caution needs to be taken to understand the context in
is the high level of subjectivity and intra- and inter-operator variability .
.
. which the AI is operating and the benefit it provides in complement
that exists between embryologists of different skill levels (Storr et al., .
. with current clinical processes. One example presented by Sahlsten
2017). This means that standardization is difficult even within a single .
. et al. (2019) described an AI model that detected fundus for diabetic
laboratory, and impossible across the industry as a whole. Other .
. retinopathy assessment with an accuracy of over 90%. In this domain,
complementary techniques are available for assisting with embryo .
. the clinician baseline accuracy is ∼90%, and therefore an AI accuracy of
selection, such as time-lapse imaging, which continuously monitors the .
.
growth of embryos in culture with simple algorithms that assess critical . >90% is reasonable and necessary to justify clinical relevance. Similarly,
.
growth milestones. Although this approach is useful in determining . in the field of embryology, AI models developed by Khosravi et al.
.
whether an embryo at an early stage will develop through to a mature . (2019) and Kragh et al. (2019) showed high levels of accuracy in
.
. classification of blastocyst images according to the well-established
blastocyst, it has not been demonstrated to reliably predict pregnancy .
. Gardner scale. This approach is expected to yield a high accuracy,
outcomes, and therefore is limited in its utility for embryo selection at .
.
the Day 5 time point (Chen et al., 2017). Additionally, the requirement . as the model simply mimics the Gardner scale to predict a known
.
for specialized time-lapse imaging hardware makes this approach cost . outcome. While this method may be useful for standardization of
.
prohibitive for many laboratories and clinics, and limits widespread use . embryo classification according to the Gardner scale, it is not in fact
.
of the technique. . predictive of pregnancy success as it is based on a different endpoint.
.
. Of relevance to the current study, an AI model developed by Tran et al.
The objective of the current clinical investigation was to develop .
.
and test a non-invasive artificial intelligence (AI)-based assessment . (2019) was in fact intended to classify embryo quality based on clinical
772 VerMilyea et al.
Table I Results of pilot study demonstrate feasibility of creating an artificial intelligence-based image analysis model for
prediction of human embryo viability.
Validation dataset Blind Test Set 1 Blind Test Set 2 Total blind test dataset
.......................................................................................................................................................................................
Composition of dataset
.......................................................................................................................................................................................
Total image no. 390 368 632 1000
No. of positive clinical pregnancies 70 76 194 270
No. of negative clinical pregnancies 320 292 438 730
.......................................................................................................................................................................................
AI model accuracy
.......................................................................................................................................................................................
Accuracy viable embryos (sensitivity) 74.3% 63.2% 78.4% 74.1%
a Images where the AI model was correct and the embryologist was incorrect, and vice versa.
pregnancy outcome. This study did not report percentage accuracy . gists would be considered highly relevant in this clinical domain. In the
.
.
of prediction, but instead reported a high level of accuracy for their . current study, we aimed to develop an AI model that demonstrated
.
model IVY using a receiver operating characteristic (ROC) curve. The . superiority to embryologists’ predictive power for embryo viability, as
.
AUC for IVY was 0.93 for true positive rate versus false positive rate; . determined by ground-truth clinical pregnancy outcome.
.
negative predictions were not evaluated. However, the datasets used .
.
for training and evaluation of this model were only partly based on .
actual ground-truth clinical pregnancy outcome—a large proportion
.
.
.
Materials and Methods
.
of predicted non-viable embryos were never actually transferred, and . Experimental design
.
were only assumed to lead to an unsuccessful pregnancy outcome. .
. These studies were designed to analyze retrospectively collected data
Thus, the reported performance is not entirely relevant in the context .
.
of clinical applicability, as the actual predictive power for the presence . for development and testing of the AI-based model in prediction of
.
of a fetal heartbeat has not been evaluated to date. . embryo viability. Data were collected for female subjects who had
.
The AI approach presented here is the first study of its kind to evaluate . undergone oocyte retrieval, IVF and embryo transfer. The investiga-
.
. tion was non-interventional, and results were not used to influence
the true ability of AI for predicting pregnancy outcome, by exclu- .
. treatment decisions in any way. Data were obtained for consecutive
sively using ground-truth pregnancy outcome data for AI development .
. patients who had undergone IVF at 11 independent clinics in three
and testing. It is important to note that while a pregnancy outcome .
.
endpoint is more clinically relevant and informative; it is inherently . countries (the USA, Australia and New Zealand) from 2011 to 2018.
.
more complex in nature due to patient and laboratory variability that . Data were limited to patients who received a single embryo transfer
.
impact pregnancy success rates beyond the quality of the embryo itself. . with a Day 5 embryo, and where the endpoint was clinical pregnancy
.
The theoretical maximum accuracy for prediction of this endpoint . outcome as measured by fetal heartbeat at first scan. The clinical preg-
.
.
based on evaluation of embryo quality is estimated to be ∼80%, with . nancy endpoint was deemed to be the most appropriate measure of
.
the remaining 20% affected by patient-related clinical factors, such as . embryo viability as this limited any confounding patient-related factors
.
endometriosis, or laboratory process errors in embryo handling, etc., . post-implantation. Criteria for inclusion/exclusion were established
.
that could lead to a negative pregnancy outcome despite a morpho- . prospectively, and images not matching the criteria were excluded from
.
logically favorable embryo appearance (Annan et al., 2013). Given . analysis.
.
the influence of confounding variables, and the low average accuracy . For inclusion in the study, images were required to be of embryos
.
.
presently demonstrated by embryologists using traditional morpholog- . on Day 5 of culture taken using a standard optical light microscope
.
ical grading methods (∼50% in the current study, Tables I and II), an AI . mounted camera. Images were only accepted if they were taken prior
.
model with an improvement of even 10–20% over that of embryolo- . to PGS biopsy or freezing. All images were required to have a minimum
Artificial intelligence for embryo viability assessment 773
Table II Results of the pivotal study demonstrate generalizability of the Life Whisperer AI model for prediction of human
embryo viability across multiple clinical environments.
a Embryologist
scores were not available. b Images where the AI model was correct and the embryologist was incorrect, and vice versa.
ND = not done; No. = number.
.
resolution of 512 x 512 pixels with the complete embryo in the . review pursuant to Section 3 of the Standard Operating Procedures for
.
field of view. Additionally, all images were required to have matched . Health and Disability Ethics Committees, Version 2.0 (August 2014).
.
clinical pregnancy outcome available (as detected by the presence of . These studies were not registered as a clinical trial as they did
.
a fetal heartbeat on first ultrasound scan). For a subset of patients, . not meet the definition of an applicable clinical trial as defined
.
the embryologist’s morphokinetic grade was available, and was used . by the ICMJE, that is, ‘a clinical trial is any research project that
.
.
to compare the accuracy of the AI with the standard visual grading . prospectively assigns people or a group of people to an intervention,
.
method for those patients. . with or without concurrent comparison or control groups, to study
.
. the relationship between a health-related intervention and a health
.
. outcome’.
Ethics and compliance .
.
.
All patient data used in these studies were retrospective and pro- .
. Viability scoring methods
vided in a de-identified format. In the USA, the studies described .
.
were deemed exempt from Institutional Review Board (IRB) review . For the AI model, an embryo viability score of 50% and above was
.
pursuant to the terms of the United States Department of Health and . considered viable, and below 50% non-viable. Embryologist’s scores
.
Human Service’s Policy for Protection of Human Research Subjects . were provided for Day 5 blastocysts at the time when the image
.
at 45 C.F.R. § 46.101(b) (IRB ID #6467, Sterling IRB). In Australia, . was taken. These scores were placed into scoring bands, which
.
.
the studies described were deemed exempt from Human Research . were roughly divided into ‘likely viable’ and ‘likely non-viable’ groups.
.
Ethics Committee review pursuant to Section 5.1.2.2 of the National . This generalization allowed comparison of binary predictions from
.
Statement on Ethical Conduct in Human Research 2007 (updated . the embryologists with predictions from the AI model (viable/non-
.
2018), in accordance with the National Health and Medical Research . viable). The scoring system used by embryologists was based on
.
Council Act 1992 (Monash IVF). In New Zealand, the studies described . the Gardner scale of morphokinetic grading (Gardner and Sakkas,
.
.
were deemed exempt from Health and Disability Ethics Committee . 2003) for the quality of the ICM and the trophectoderm of the
774 VerMilyea et al.
.
embryo, indicated by a single letter (A–E). Also included was either a .
.
numerical score or a description of the embryo’s stage of development .
.
toward hatching. Numbers were assigned in ascending order of .
.
embryo development as follows: 1 = start of cavitation, 2 = early .
.
blastocyst, 3 = full blastocyst, 4 = expanded blastocyst and 5 = hatching .
.
.
blastocyst. If no developmental stage was given, it was assumed that .
.
the embryo was at least an early blastocyst (>2). The conversion .
.
table for all clinics providing embryologists scores is provided in .
.
Supplementary Table SI. For embryologist’s scores, embryos of 3BB .
.
or higher grading were considered viable, and below 3BB considered .
.
non-viable. .
.
.
Comparisons of embryo viability ranking were made by equating .
.
the embryologist’s assessment with a numerical score from 1 to .
.
where the diagonal whitespace in the square image due to these . In all cases, use of ImageNet pretrained weights demonstrated
.
rotations was filled in with background color using the OpenCV . improved performance of these quantities.
.
(version 3.2.0; Willow Garage, Itseez, Inc. and Intel Corporation; .
. Loss functions that were evaluated as options for the model’s hyper-
2200 Mission College Blvd. Santa Clara, CA 95052, USA) ‘Bor- .
. parameters included cross entropy (CE), weighted CE and residual
der_Replicate’ method, which uses the pixel values near the image .
. CE loss function. The accuracy on the validation set was used as the
.
border to fill in whitespace after rotation. .
. selection criterion to determine a loss function. Following evaluation,
– Reflections: Horizontal or vertical reflections of the image were .
. only weighted CE and residual CE loss functions were chosen for use
also included as training augmentations. .
. in the final model, as these demonstrated improved performance. For
– Gaussian blur: Gaussian blurring was applied to some images using .
. more information see the Model selection process section.
a fixed kernel size (with a default value of 15). .
.
– Contrast variation: Contrast variation was introduced to images .
.
.
by modifying the standard deviation of the pixel variation of the .
.
Deep learning optimization specifications
image from the mean, away from its default value. .
.
– This process was repeated thousands of times until the loss was . model across all datasets and was therefore chosen as the preferred
.
reduced as much as possible and the value plateaued. . model.
.
– When all batches in the training dataset had been assessed (i.e. . The final ensemble model includes eight deep learning models of
.
1 epoch), so that the entire training set had been covered, the . which four are zona models and four are full-embryo models. The final
.
training set was re-randomized, and training was repeated. . model configuration used in this study is as follows:
.
.
– After each epoch, the model was run on a fixed subset of images .
. – One full-embryo ResNet-152 model, trained using SGD with
reserved for informing the training process to prevent over- .
. momentum = 0.9, CE loss, learning rate 5.0e-5, step-wise sched-
training (the validation set). .
. uler halving the learning rate every 3 epochs, batch size of 32,
– The train-validate cycle was carried out for 2–100 epochs until a .
. input resolution of 224 x 224 and a dropout value of 0.1.
sufficiently stable model was developed with low loss function. At .
. – One zona model ResNet-152 model, trained using SGD with
the conclusion of the series of train-validate cycles, the highest .
.
. momentum = 0.99, CE loss, learning rate 1.0e-5, step-wise sched-
performing models were combined into a final ensemble model .
. uler dividing the learning rate by 10 every 3 epochs, batch size of
as described below. .
proportion of the original embryologist accuracy (i.e. (AI_accuracy— . A binary confusion matrix containing class accuracy measures, i.e.
.
.
embryologist_accuracy)/embryologist_accuracy). . sensitivity and specificity, was also used in model assessment. The con-
.
For these analyses, embryologist scores corresponding to blastocyst . fusion matrix evaluated model classification and misclassification based
.
assessment at Day 5 were provided, that is, their assessment was . on true positives, false negatives, false positives and true negatives.
.
provided at the same point in time as when the image was taken. This . These numbers were depicted visually using tables or ROC plots where
.
ensured that the time point for model assessment and the embryol- . applicable.
.
.
ogist’s assessment were consistent. Note that the subset of data that . Final model accuracy was determined using results from blind
.
includes corresponding embryologist scores was sourced from a range . datasets only, as these consisted of completely independent ‘unseen’
.
of clinics, and thus the measurement of the embryologist grading accu- . datasets that were not used in model training or validation. In general,
.
racy varied across each clinic, from 43.9% to 55.3%. This is due to the . the accuracy of any validation dataset will be higher than that of a blind
.
variation in embryologist skill, and statistical fluctuation of embryologist . dataset, as the validation dataset is used to guide training and selection
.
scoring methods across the dataset. In order to provide a comparison . of the AI model. For a true, unbiased measure of accuracy, only blind
.
.
that ensured the most representative distribution of embryologist skill . datasets were used. The number of replicates used for determination
.
levels, all embryologist scores were considered across all clinics, and . of accuracy was defined as the number of completely independent
.
combined in an unweighted manner, instead of considering accuracies . blind test sets comprising images that were not used in training the AI
.
from individual clinics. This approach therefore captured the inherent . model. Double-blind test sets, consisting of images provided by clinics
.
diversity in embryologist scoring efficacy. . that did not provide any data for model training, were used to evaluate
.
The distributions of prediction scores for both viable and non- . whether the model had been over-trained on data provided by the
.
.
viable embryo images were used to determine the ability of the . original clinics.
.
AI model to separate true positives from false negatives, and true .
.
negatives from false positives. AI model predictions were normalized .
.
between 0 and 1, and interpreted as confidence scores. Distributions .
.
Results
were presented as histograms based on the frequency of confidence .
.
scores. Bi-modal distributions of predictions indicated that true pos-
.
. Datasets used in model development
.
itives and false negatives, or true negatives and false positives, were . Model development was divided into two distinct studies. The first
.
separated with a degree of confidence, meaning that the predictive . study consisted of a single-site pilot study to determine the feasibility
.
power of the model on a given dataset was less likely to have been . of creating an AI model for prediction of embryo viability, and refine
.
obtained by chance. Alternatively, slight asymmetry in a unimodal . the training techniques and principles to be adopted for a second multi-
.
Gaussian-like distribution falling on either side of a threshold indicated . site study. The first study, or pilot study, was performed using a total of
.
.
that the model was not easily able to separate distinct classes of . 5282 images provided by a single clinic in Australia, with 3892 images
.
embryo. . used for the training process. The AI model techniques explored in
778 VerMilyea et al.
the pilot study were then further developed in a second, pivotal study . 1 comprised images from the same clinics that provided images for
.
.
to determine generalizability to different clinical environments. The . training. Blind Test Sets 2 and 3 were, however, provided by completely
.
pivotal study used a total of 3604 images provided by 11 clinics from . independent clinics that did not provide any data for training. Thus,
.
across the USA, Australia and New Zealand, with a total of 1744 . Blind Test Sets 2 and 3 represented double-blinded datasets as relates
.
images of Day 5 embryos used for training the Life Whisperer AI model . to AI computational methods.
.
presented in this article. . In total, 52.5% of all images in the blind datasets had embryologist
.
Images were split into defined datasets for model development in . grades available for comparison of outcome. Note that embryologist’s
.
.
each study, which included a training dataset, a validation dataset and . grades were not available for Blind Test Set 2 in the pivotal study;
.
multiple blind test sets. Figure 4 depicts the number and origin of . therefore, n = 2 for both studies in comparison of AI model accuracy
.
images that were used in each dataset in both the pilot and pivotal . to that of embryologists.
.
studies. A significant proportion of images in each study were used in .
.
model training, with a total of 3892 images used in the pilot study, and a .
.
further 1744 images used in the pivotal study. AI models were selected . Pilot feasibility study
.
.
and validated using validation datasets, which contained 390 images in . Table I shows a summary of results for the pilot study presented
.
the pilot study and 193 in the pivotal study. Accuracy was determined . according to dataset (validation dataset, individual blind test sets and
.
using blind datasets only, comprising a total of 1000 images in the pilot . combined blind test dataset). In this study, negative pregnancies were
.
study, and 1667 images in the pivotal study. Two independent blind test . found to outweigh positive pregnancies by approximately 3-fold. Sen-
.
sets were evaluated in the pilot study; these were both provided by . sitivity of the Life Whisperer AI model for viable embryos was 74.1%,
.
. and specificity for non-viable embryos was 65.3%. The greater sen-
the same clinic that provided images for training. Three independent .
.
blind test sets were evaluated in the pivotal study. Blind Test Set . sitivity compared to specificity was to be expected, as it reflects the
Artificial intelligence for embryo viability assessment 779
intended bias to grade embryos as viable that was introduced during . similar to the initial accuracy values obtained in the pilot study, although
.
model development. Overall accuracy for the AI model was 67.7%. . values were marginally lower—this was not unexpected due to the
.
.
For the subset of images that had embryologist’s scores available, the . introduction of inter-clinic variation into the AI development. Note
.
AI model provided an average accuracy improvement of 30.8% over . that while the sensitivity in Blind Test Set 3 was ∼5–7% lower than
.
embryologist’s grading for viable/non-viable predictions (P = 0.021, .
. that of Blind Test Sets 1 and 2, the specificity was ∼8% higher than in
n = 2, Student’s t test). The AI model correctly predicted viability . those datasets, making the overall accuracy comparable across all three
.
over the embryologist 148 times, whereas the embryologist correctly . blind test sets. The overall accuracy in each blind test set was >63%,
.
predicted viability over the model 54 times, representing a 2.7-fold . with a combined overall accuracy of 64.3%.
.
.
improvement for the AI model. . Binary comparison of viable/non-viable embryo classification
.
The AI model developed in this pilot study was used as a basis for . demonstrated that the AI model provided an average accuracy
.
further development in the pivotal study described below. . improvement of 24.7% over embryologist’s grading (P = 0.047, n = 2,
.
. Student’s t test). The AI model correctly predicted viability over
.
. the embryologist 172 times, whereas the embryologist correctly
.
Model accuracy and generalizability . predicted viability over the model 78 times, representing a 2.2-fold
.
.
The results of the pivotal study are presented in Table II. In this study, . improvement for the AI model. Comparison to embryologist’s scores
.
the distribution of negative and positive pregnancies was more even . using the 5-band ranking system approach showed that the AI model
.
than in the pilot study, with negative pregnancies occurring ∼50% more . was correct over embryologists for 40.6% of images, and incorrect
.
often than positive pregnancies (1.5-fold increase compared to 3-fold . compared to embryologist’s scoring for 28.6% of images, representing
.
. an improvement of 42.0% (P = 0.028, n = 2, Student’s t test).
increase in the pilot study). .
.
After further development using data from a range of clinics, the Life . Confusion matrices showing the total number of true positives, false
.
Whisperer AI model showed a sensitivity of 70.1% for viable embryos, . positives, false negatives and true negatives obtained from embryol-
.
and a specificity of 60.5% for non-viable embryos. This was relatively . ogist grading methods and the AI model are shown in Figure 5. By
780 VerMilyea et al.
.
. As a final measure of AI model performance, the distributions of
.
. prediction scores for viable and non-viable embryos were graphed to
.
. evaluate the ability of the model to separate correctly from incorrectly
.
. identified embryo images. The histograms for distributions of predic-
.
. tion scores are presented in Figure 7. The shapes of the histograms
.
. demonstrate clear separation between correctly and incorrectly iden-
.
.
. tified viable or non-viable embryo images.
.
.
.
.
.
.
.
.
.
Discussion
.
. In these studies, Life Whisperer used ensemble modeling to combine
.
. computer vision methods and deep learning neural network techniques
.
pregnancy. These studies achieved 76.4% (Segal et al., 2018) and >93% . outcome, the IVY AI was trained on a heavily biased dataset of only
.
(Wong et al., 2010) accuracy, respectively, in their overall classification . 694 cases (8%) of positive pregnancy outcomes, with 8142 negative
.
objectives, which included prediction of blastocyst formation based on . outcome cases (92%). Additionally, 87% (7063 cases) of the negative
.
Day 2 and/or Day 3 morphology and a number of other independent . outcome cases were from embryos that were never transferred to a
.
data points. However, it is important to note that blastocyst formation . patient, discarded based on abnormal morphology considerations or
.
is not a reliable indicator of the probability of clinical pregnancy, and . aneuploidy, and therefore the ground-truth clinical pregnancy outcome
.
.
therefore the utility of this approach for prediction of pregnancy . cannot be known. The approach used to train the IVY AI only used
.
outcome is limited. . ground-truth pregnancy outcome for a very small proportion of the
.
As discussed earlier, three recent studies described development of . algorithm training and thus has a heavy inherent bias toward the embry-
.
AI-based systems for classification of embryo quality (Khosravi et al., . ologist assessment for negative outcome cases. Although somewhat
.
2019; Kragh et al., 2019; Tran et al., 2019). All three studies utilized . predictive of pregnancy outcome, the accuracy of the AI has not truly
.
.
images taken by time-lapse imaging systems, which likely standardized . been measured on ground-truth outcomes of clinical pregnancy, and
.
the quality of images provided for analysis compared to those obtained . gives a false representation of the true predictive accuracy of the AI,
.
by standard optical light microscopy. Khosravi et al. (2019) reported . which can only be truly assessed on an AI model that has been trained
.
an accuracy of 97.5% for their model STORK in predicting embryo . exclusively on known fetal heartbeat outcome data.
.
grade. Their model was not, however, developed to predict clinical . The works discussed above are not experimentally comparable with
.
pregnancy outcome. The high reported accuracy in this case may be . the current study, as they generally relate to different endpoints; for
.
.
attributed to the fact that the analysis was limited to classification of . example, the prediction of blastocyst formation at Day 5 starting
.
poor versus good quality embryos—fair quality embryos in between . from an image at Day 2 or Day 3 post-IVF. While there is some
.
were excluded from analysis. Similarly, Kragh et al. (2019) reported . benefit in these methods, they do not provide any power in pre-
.
accuracies of 71.9% and 76.4% for their model in grading embryonic . dicting clinical pregnancy, in contrast to the present study evaluating
.
ICM and trophectoderm, respectively, according to standard morpho- . the Life Whisperer model. Other studies have shown that a high
.
logical grading methods. This was shown to be at least as good as . level of accuracy can be achieved through the use of AI in repli-
.
.
the performance of trained embryologists. The authors also evaluated . cating embryologist scoring methods (Khosravi et al., 2019; Kragh
.
predictive accuracy for implantation, for which data were available for . et al., 2019); however, the work presented here has shown that
.
a small cohort of images. The AUC for prediction of implantation was . the accuracy of embryologist grading methods in predicting clinical
.
not significantly different to that of embryologists (AUC of 0.66 and . pregnancy rates is in actuality fairly low. An AI model trained to
.
0.64, respectively), and therefore, this model has limited ability for . replicate traditional grading methods to a high degree of accuracy
.
.
prediction of pregnancy outcome. . may be useful for automation and standardization of grading, but it
.
The approach taken by Tran et al. (2019) for development of their . can, at best, only be as accurate as the grading method itself. In
.
AI model IVY used deep learning to analyze time-lapse embryo images . the current study, only ground-truth outcomes for fetal heartbeat
.
to predict pregnancy success rates. This study used 10 683 embryo . at first scan were used in the training, validation and testing of the
.
images from 1648 individual patients throughout the course of the . model. Given the nature of predicting implantation based on embryo
.
training and development of IVY, with 8836 embryos coded as positive . morphology, which will necessarily be confounded by patient factors
.
.
or negative cases. Of note, although developed to predict pregnancy . beyond the scope of morphological assessment, it would be expected
782 VerMilyea et al.
.
that the overall accuracy of the Life Whisperer AI model would be . study (∼50%), and a theoretical maximum accuracy of 80%, Life
.
lower than alternative endpoints but more clinically relevant. For the . Whisperer’s AI model accuracy of ∼65% represents a significant and
.
first time, this study presents a realistic measurement of AI accu- . clinically relevant improvement for predicting embryo viability in this
.
racy for embryo assessment and a true representation of predictive . domain.
.
ability for the pregnancy outcome endpoint. Given the relatively low . The present study demonstrated that the Life Whisperer AI model
.
.
accuracy for embryologists in predicting viability, as shown in this . provided suitably high sensitivity, specificity, and overall accuracy levels
Artificial intelligence for embryo viability assessment 783
.
for prediction of embryo viability based directly on ground-truth clinical .
. Acknowledgements
pregnancy outcome by indication of positive fetal cardiac activity on .
. The authors acknowledge the kind support of investigators and
ultrasound. The model was able to predict embryo viability by analysis .
.
of images obtained using standard optical light microscope systems, . collaborating clinics for providing embryo images and associated
.
which are utilized by the majority of IVF laboratories and clinics world- . data as follows: Hamish Hamilton and Michelle Lane, Monash IVF
.
. Group/Repromed (Adelaide SA, Australia); Matthew ‘Tex’ VerMilyea
wide. AUC/ROC was not used as a primary methodology for eval- .
. and Andrew Miller, Ovation Fertility (Austin TX and San Antonio TX,
uation of accuracy due to inherent limitations of the approach when .
. USA); Bradford Bopp, Midwest Fertility Specialists (Carmel IN, USA);
applied to largely unbalanced datasets, such as those used in develop- .
. Erica Behnke, Institute for Reproductive Health (Cincinnati OH, USA);
ment of IVY (which used a dataset with a ∼13:1 ratio of negative to .
.
positive clinical pregnancies) (Tran et al., 2019). Nevertheless, the ROC . Dean Morbeck, Fertility Associates (Auckland, Christchurch, Dunedin,
.
curve for the Life Whisperer AI model is presented for completeness in . Hamilton and Wellington, New Zealand); and Rebecca Matthews,
.
. Oregon Reproductive Medicine (Portland OR, USA).
Supplementary Figure SI with results demonstrating an improved AUC .
.
for the AI model when compared to embryologist’s scores. .
Gardner DK, Sakkas D. Assessment of embryo viability: the ability . Sahlsten J, Jaskari J, Kivinen J, Turunen L, Jaanio E, Hietala K,
.
to select a single embryo for transfer—a review. Placenta 2003;24: . Kaski K. Deep learning fundus image analysis for diabetic
.
.
S5–S12. . retinopathy and macular edema grading. Sci Rep 2019;9:
.
GBD. Population and fertility by age and sex for 195 countries and . 10750.
.
territories, 1950–2017: a systematic analysis for the Global Burden . Segal TR, Epstein DC, Lam L, Liu J, Goldfarb JM, Weinerman R.
.
of Disease Study 2017. Lancet 2018;392:1995–2051. . Development of a decision tool to predict blastocyst formation. Fertil
.
He K, Zhang X, Ren S, Sun J. Deep residual learning for image recog- . Steril 2018;109:e49–e50.
.
.
nition. IEEE Conference on Computer Vision and Pattern Recognition . Storr A, Venetis CA, Cooke S, Kilani S, Ledger W. Inter-observer and
.
(CVPR), 27–30 June, Piscataway NJ, US: Institute of Electrical and . intra-observer agreement between embryologists during selection
.
Electronics Engineers, 2016;770–778. . of a single Day 5 embryo for transfer: a multicenter study. Hum
.
Hearst MA. Support vector machines. IEEE Intell Syst 1998;13:18–28. . Reprod 2017;32:307–314.
.
Huang G, Liu Z, Maaten LVD, Weinberger KQ. Densely connected . Szegedy C, Ioffe S, Vanhoucke V. Inception-v4, inception-ResNet
.
convolutional networks. IEEE Conference on Computer Vision and . and the impact of residual connections on learning. Proceedings