Error Classification
Error Classification
error classification
Chaoqiong Ma* Ruoxi Wang* Shun Zhou, Meijiao Wang and Haizhen Yue
Key laboratory of Carcinogenesis and Translational Research (Ministry of Education/Beijing), Department of Radiation Oncology,
Peking University Cancer Hospital & Institute, Beijing 100142, China
Yibao Zhang and Hao Wua)
Key laboratory of Carcinogenesis and Translational Research (Ministry of Education/Beijing), Department of Radiation Oncology,
Peking University Cancer Hospital & Institute, Beijing 100142, China
Institute of Medical Technology, Peking University Health Science Center, Beijing 100191, China
(Received 30 June 2020; revised 3 September 2020; accepted for publication 15 October 2020;
published 27 November 2020)
Purpose: The implementation of radiomics and machine learning (ML) techniques on analyzing
two-dimensional gamma maps has been demonstrated superior to the conventional gamma analysis
for error identification in intensity modulated radiotherapy (IMRT) quality assurance (QA). Recently,
the Structural SIMilarity (SSIM) sub-index maps were shown to be able to reveal the error types of
the dose distributions. In this study, we aimed to apply radiomics analysis on SSIM sub-index maps
and develop ML models to classify delivery errors in patient-specific dynamic IMRT QA.
Methods: Twenty-one sliding-window IMRT plans of 180 beams for three treatment sites were
involved in this study. Four types of machine-related errors of various magnitudes were simulated for
each beam at each control point, including the monitor unit (MU) variations, same-directional and
opposite-directional shifts of the multileaf collimators (MLCs) and random mispositioning of the
MLCs. In the QA process, a total of 1620 portal dose (PD) images were acquired for the beams with
and without errors. The predicted PD images of the original beams were set as references. To quan-
tify the agreement between a measured PD image and the corresponding predicted PD image, four
difference maps including three SSIM sub-index maps, and one dose difference-derived map were
calculated. Then, radiomic features were extracted from the four difference maps of each measured
PD image. We tested four typical classifiers including linear discriminant classifier (LDC), two sup-
porting vector machine (SVM) classifiers, and random forest (RF) for this multiclass classification
task. A nested cross-validation scheme was used for model evaluations, where the SVM recursive
feature elimination method was applied for feature selection. Finally, the performance of the ML
model on identifying the error-free and the erroneous cases was compared to that of the conventional
gamma analysis.
Results: The statistics of the selected features showed that all of the difference maps and the feature
categories made balanced contributions to solve this classification task. Best performance was
achieved by the Linear-SVM model with average overall classification accuracy of 0.86. Specifically,
the average classification accuracies of the shift, opening, and the random errors were around 0.9.
Moreover, ~80% of error-free and MU errors were correctly classified. Using gamma analysis, the
3 mm/3% criterion was found insensitive to errors (sensitivity was only 0.33). Although the sensitiv-
ity to errors with the 2 mm/2% criterion increased to 0.79, still 8% worse than that of the ML model.
Conclusions: We proposed an ML-based method for machine-related error identification in patient-
specific dynamic IMRT QA, where radiomic analysis on SSIM sub-index maps were used for feature
extraction. With extensive validation to select the best features and classifiers, high accuracies in
error classification were achieved. Compared with the conventional gamma threshold method, this
approach has great potential in error identification for the patient-specific IMRT QA process. ©
2020 American Association of Physicists in Medicine [https://doi.org/10.1002/mp.14559]
Key words: IMRT QA, machine learning, quality assurance, radiomics, SSIM analysis
80 Med Phys 48 (1), January 2021 0094-2405/2021/48(1)/80/14 © 2020 American Association of Physicists in Medicine 80
81 Ma et al.: SSIM analysis of errors in IMRT QA 81
measurements with film,3 detector array,4 or electronic por- maps. Different from the gamma maps, the SSIM not only
tal imaging device (EPID).5 reflects perceptual image differences, but also possesses sev-
Gamma analysis has been widely accepted as a quantita- eral independent components describing different local pat-
tive tool to evaluate the agreement between planned and mea- terns of two images. We assert that including
sured dose distributions. In clinical applications, the gamma multidimensional difference measures from the SSIM maps
pass rate is calculated under a criterion combined thresholds would lead to substantial improvement in error classification
in dose difference (DD) and distance-to-agreement (DTA) to based on the planar QA dose images. To verify this assump-
assess the clinical acceptability of IMRT plans.6 However, it tion, several classification models were trained based on the
has been reported that the gamma analysis is insensitive to SSIM sub-index maps of EPID images with or without simu-
dose errors and the results do not exhibit correlations with lated errors. The performance of the models was assessed
the clinical dose errors.7–9 Additionally, it is difficult to iden- and compared with conventional gamma analysis in terms of
tify the root cause of existing discrepancy between dose dis- error detection rate. To the authors’ knowledge, this is the
tributions, owning to the fact that the spatial information is first work in patient-specific IMRT QA error classification
discarded in gamma pass rate. combining the SSIM measures with the radiomics analysis.
To overcome the limitations of conventional gamma analy-
sis, attempts have been made on establishing the relationship
2. MATERIALS AND METHODS
between observed spatial dose discrepancy, and underlying
error types in IMRT QA. Radiomics analysis and convolu- The main hypothesis of this study was that the machine-
tional network (CNN) were employed to extract features from related errors in treatment plan delivery can be detected and
the gamma maps between calculated planar dose maps and distinguished by quantifying the agreement between the mea-
reconstructed planar dose maps based on EPID measure- sured and the predicted portal dose (PD) images of a beam
ments.10,11 Based on the radiomic features extracted from using a radiomics-based ML model in patient-specific QA.
gamma maps, machine learning (ML) models were devel- As is shown in Fig. 1, our data analysis pipeline is composed
oped to identify MLC leaf errors. The applications of radio- of four major steps: error simulation, image preprocessing,
mics and CNN on the analysis of gamma maps in IMRT QA radiomic feature extraction, and model evaluation.
have been demonstrated to provide complementary informa-
tion to traditional gamma analysis. In addition to gamma • Error simulation: Four types of machine-related errors
were simulated for each beam of the selected IMRT
maps, the DD maps were created from volumetric modulated
plans. The PD images were acquired for the plans with
arc therapy (VMAT) QA and analyzed using CNN models.12
and without introduced errors. The predicted PD
Higher accuracy was found on the classification of MLC
images for the original plans were regarded as refer-
positional errors using the DD maps comparing to that of the
ences.
•
gamma maps in VMAT QA.
Image preprocessing: To quantify the effects of the
Recently, the Structural SIMilarity (SSIM) index, an
simulated errors, the difference maps between the mea-
image-quality metric widely used as an objective perceptual
sured PD image and the predicted error-free PD image
measure in the image processing field, was introduced into
were calculated: three from the SSIM sub-indices and
the radiotherapy field.13 The SSIM index was developed to
one from a DD-derived metric. The rationale to intro-
measure the similarity between two images, modeling any
duce an additional metric of a Gaussian-transformed
discrepancy as a combination of loss in correlation, lumi-
DD was due to the reported insensitivity of the SSIM
nance difference, and contrast difference. The capability of
sub-indices to small absolute luminance changes.13,14
the SSIM analysis to reveal different types of errors between
Thus, four difference maps were obtained for each
two dose distributions has been demonstrated by Peng et al.14
measured PD image.
The sub-indices of SSIM, luminance, contrast, and structure,
were indicated capable of detecting absolute dose error, gra- • Radiomic Feature extraction: As the input of the classi-
fication model, radiomic features were extracted from
dient discrepancy, and dose structure error, respectively. Most
each difference maps. Since the radiomics analysis is
importantly, the SSIM sub-index maps could not only indi-
typically performed within the regions of interest
cate the location of large discrepancies, but also demonstrate
(ROIs), the ROI in each difference map was the jaw-de-
different types of error-related patterns, which make the
fined field projection on the EPID.
•
SSIM analysis a potential tool for error identification in
Feature selection method and classifiers: A recursive
IMRT QA. The only study on implementing SSIM index to
feature elimination (RFE) method coupled with a sup-
intra-fraction patient verification showed higher sensitivity
port vector machine (SVM), SVM-RFE, was used for
on the patient positional and anatomical variations than that
the feature selection. We tested four classifiers includ-
of the conventional gamma analysis.15 However, there is a
ing linear discriminant classifier (LDC), supporting
lack of feasibility study on applying the SSIM sub-index
vector machine (SVM) with linear function kernel
maps to IMRT QA for error identification.
(Linear-SVM), SVM with radial basis function kernel
In this study, we aimed to develop a method for error
(RBF-SVM) and random forest (RF) for this classifica-
detection and classification in patient-specific dynamic IMRT
tion task.
QA using image features extracted from the SSIM sub-index
• Model evaluation: To select the optimal model, the addition, good linearity of the EPID dose response
(≤ 0:4%) was found of this imager as well. Therefore, this
classification accuracy of each model was assessed
based on a nested cross validation (CV) scheme. imager can provide more accurate measurements to verify
the pretreatment delivery.
In the following parts, each step of the pipeline is According to the manufacturer’s recommendation,17 the
described in details. In this work, we have chosen the EPID EPID was firstly calibrated by acquisition of a dark field (DF)
as the dose image acquisition modality in considerations of image and a flood field (FF) image, which were used to elim-
both efficiency and reproducibility. Since the measurement inate the background noise and correct the pixel-to-pixel
and the calculation of the PD images are essential in the pro- response differences, respectively, for a raw image. Then, the
cess of dataset creation, the EPID calibration procedure and dosimetric calibration of the EPID was performed to evaluate
PD image prediction algorithm is introduced at first. the acquired PD images in Calibrated Units (CUs), where 1
CU was defined as the central pixel response for a 10 10-
cm2 field at a source-to-imager distance (SID) of 100 cm to
2.A. Portal dosimetry 100 MU. At last, two-dimensional beam profile correction
In this study, all PD measurements were acquired with was performed to bring back the expected nonuniformity of
a Varian aS1200 EPID, mounted on a Varian VitalBeam the treatment beam which was divided out by the FF correc-
LINAC equipped with a Millennium 120 leaf MLC (Var- tion.
ian Medical Systems, Palo Alto, CA). This imager had an The predicted PD images, with which the acquired PD
active area of 40 40 cm2 with 1190 1190 pixel arrays images were compared, were calculated using the portal
and 0.336 mm pixel pitch. The new backscatter shielding dosimetry image prediction (PDIP) algorithm.18,19 In this
design of this imager has been demonstrated to be able to study, the EPID-based dosimetry verifications, including
reduce the backscatter artifacts from the robotic support acquisition and prediction of the PD images, were all per-
arm efficiently comparing to the previous models.16 In formed at SID of 100 cm.
FIG 2. Structural SIMilarity sub-index maps (left three columns) and dose difference-derived maps (rightmost column) generated from measured portal dose
(PD) images with and without errors of a single beam, where the reference is the corresponding predicted PD image. From top to bottom row, the difference maps
correspond to PD images of error-free, random error of σ = 2 mm, shift error of 2 mm, opening error of 2 mm and MU error of +5%. Only the central patch of
size 20 20 cm2 of each map is displayed. [Color figure can be viewed at wileyonlinelibrary.com]
TABLE I. Radiomic features used in this study. outer loop was performed 120 times so that the performance
of each model was evaluated on 1080 different testing sets.
Categories Features
For each model, the overall classification accuracy (OCA),
First order 10Percentile, 90Percentile, energy, entropy, interquartile defined as the ratio of the correctly classified cases over the
statistics range, kurtosis, mean, mean absolute deviation, median, total number of cases, and the normalized confusion matrix
minimum, range, robust mean absolute deviation, root of each testing set were calculated. In the normalized confu-
mean square, skewness, total energy, uniformity, variance
sion matrix, the element ði, jÞ represented the ratio of the
GLCMs Autocorrelation, joint average, cluster prominence, cluster
shade, cluster tendency, contrast, correlation, difference
cases in class i that were classified as being in class j over the
average, difference entropy, difference variance, joint total number of cases in class i. Accordingly, the diagonal
energy, joint entropy, informational measure of elements represented the specific classification accuracies
correlation1, informational measure of correlation2, inverse (SCAs) of the corresponding classes.
difference moment, inverse difference moment normalized,
inverse difference, inverse difference normalized, inverse
variance, maximum probability, sum entropy, sum of 2.G. Comparison with gamma analysis
squares
GLDM Dependence entropy, dependence non-uniformity, The models were compared to the gamma analysis.32 The
dependence non-uniformity normalized, dependence gamma pass rate, defined as the percentage of dose points
variance, gray level non-uniformity, gray level variance, with a gamma value <1, was calculated to assess the agree-
high gray level emphasis, large dependence emphasis, large ment between each measured PD image to the corresponding
dependence high gray level emphasis, large dependence
predicted error-free PD image. The global DD/DTA criteria
low gray level emphasis, low gray level emphasis, small
dependence emphasis, small dependence high gray level were set to 3%/3 mm and 2%/2 mm. Note that 3%/3 mm is
emphasis, small dependence low gray level emphasis the most commonly used DD/DTA value for gamma criteria33
GLSZM Gray level non-uniformity, gray level non-uniformity and a stricter criterion, 2%/2 mm was adopted in this study
normalized, gray level variance, high gray level zone for the purpose of obtaining high sensitivity of the gamma
emphasis, large area Emphasis, large area high gray level analysis in detecting errors. The pixel values below 20% of
emphasis, large area low gray level emphasis, low gray level
zone emphasis, size zone non-uniformity, size zone non-
the maximum pixel value of the predicted PD image were
uniformity normalized, small area emphasis, small area ignored in the analysis. As commonly used in clinical prac-
high gray level emphasis, small area low gray level tice, a 95% gamma pass rate under each criterion was consid-
emphasis, zone entropy, zone percentage, zone variance ered acceptable for an IMRT plan. Therefore, the measured
PD images of gamma pass rates greater than the threshold
95% were classified as error-free, otherwise the images were
penalty for the error-free class was set twice as large as that classified as erroneous. For brevity, this classification method
for other classes, which resulted in heavier cost of misclassi- is called gamma threshold method. To compare with the
fication for error-free cases in model fitting process. Same gamma threshold method, the cases of various error types
misclassification penalty settings were used in SVM-RFE for classified by the former models were binned under one erro-
feature selection. neous category.
feature categories. Especially GLCMs, of which the features TABLE II. The 10 features that were selected for every training set. The names
accounted for more than half of the 20 features (11, 55% of of these ten features, as well as the categories and the difference map that
they belonged to, are listed.
total) and covered every difference map. For the other three
feature categories, GLDM, first order statistics and GLSZM, Category Difference map Feature name
the feature numbers of which were 4, 3, and 2, respectively. It
is worth mentioning that there were 10 features selected for First order Contrast Uniformity
every training set. As revealed by Table II, these 10 features GLSZM Luminance Zone entropy
were distributed in every difference map and feature category. GLSZM DD-derived Small area high gray level emphasis
Therefore, all of the first-order statistics and texture feature GLCMs Contrast Correlation
categories involved in this study played important roles in GLCMs Contrast Informational measure of correlation1
establishing the classification models. GLCMs Structure Correlation
The impact of the selected number of features was investi- GLCMs Luminance Informational measure of correlation1
gated by varying the feature set size from 10 to 50 with a step GLCMs DD-derived Cluster prominence
size of 10 in the feature selection procedure. For each of the GLDM Structure Dependence variance
1080 training set, the features were ranked and the classifica- GLDM DD-derived Large dependence high gray level emphasis
tion accuracy of each feature subset was estimated by SVM
classifier using a fivefold CV. Figure 4 illustrates the distribu-
tions of the classification accuracy with respect to the number
of selected features. The mean accuracy (marked as a white testing sets) were 0:83 0:09, 0:86 0:07, 0:86 0:07, and
square for each distribution) increased from 0.803 to 0.836 as 0:80 0:12, respectively. Therefore, the two SVM models
the number of the selected features increased from 10 to 30, had the highest and comparable OCAs, and the LDC came
and then remained fairly stable for the feature-set size greater next. RF exhibited the worst overall performance (lowest
than 30. Specifically, the mean accuracy at feature-set size of OCA).
20 was 0.827, which was <1% lower and >2% higher than To evaluate the performance of the models on predicting
that at feature-set size of 30 and 10, respectively. In addition, each error type, the normalized confusion matrix of each
the impact of the feature-set size was evaluated over the model was averaged over the 1080 testing sets (Fig. 5). For
whole dataset and the highest classification accuracy was all of the models, higher mean SCAs were found on predict-
achieved using the top 22 ranked features. Hence, reducing ing the shift, opening, and random errors (0:87 0:04) com-
the feature-set size to 20 was applied to avoid overfitting paring to that of the MU errors and error-free (0:70 0:12).
meanwhile preserve high classification accuracies. Therefore, all of the four models performed better on discrim-
inating the shift, opening, and random errors than the MU
errors and error-free. Specifically, comparable performance
3.B. Comparison of the classification models
was found between the two SVM models on identifying the
The OCAs of LDC, Linear-SVM, RBF-SVM, and RF erroneous cases: the difference of the mean SCA on each
using 120 times nested ninefold CV of the outer loop (1080 error type between the two models was within 1%. However,
the mean SCA on error-free of the Linear-SVM model was
6% higher than that of the RBF-SVM model. Moreover, the
FIG 3. Stacked bar plot summarizing the distribution of the 20 most fre-
quently selected features on the difference map, with data labels showing the
number of features in each feature category. The total counts and percentage
of the features for each difference map are labeled on the top of the corre- FIG 4. Distributions of the classification accuracy with respect to the number
sponding bar. of selected features.
two SVM models had the best performance on discriminating which means that the extracted radiomic features of these
the shift, opening, and random errors among all of the mod- three error types were highly distinguishable. However, a
els. Followed by the LDC, the mean SCAs on the opening, large portion of the group formed by MU errors overlapped
random and MU errors were 3–4% lower comparing with the with the group of error-free. This could partially explain the
Linear-SVM model. Identical mean SCAs were found on the confusion between the error-free and the MU error classes.
shift errors and error-free between the LDC and the Linear-
SVM models. Though the RF model had the highest mean
3.C. Influence of difference analysis and
SCA on MU errors (0.82), the lowest mean SCAs were found
misclassification penalty
on error-free and the other three error types. Especially on
error-free, the mean SCA of the RF model was only 0.58, As we mentioned in Section 2.B.4, the purpose of com-
which was much lower than that of the other three models. bining the DD analysis with the SSIM analysis in comparing
Overall, the RF model had the worst classification perfor- the measured and the predicted PD images was to compen-
mance and the best performance was achieved by the Linear- sate the insensitivity of the luminance index to the absolute
SVM model. CU changes. In order to see how the DD-derived map could
To obtain some insights into the relationships within the affect the classification performance, the best performing
data, the linear discriminant analysis (LDA) was applied to model, Linear-SVM, was trained and evaluated on the dataset
project the radiomic features onto the first two of the most without the radiomic features from the DD-derived maps.
discriminative features for visualization. As is presented in The aforementioned nested CV scheme was used for model
Fig. 6, the data were projected down to a two-dimensional evaluation and the normalized confusion matrix is illustrated
scatter plot. In general, the errors of the same type were more in Fig. 7. Over 48% of the error-free and MU errors were
likely to cluster together. Except slightly overlapping in the misclassified as each other, which means that the model was
area where the smaller errors gathered, the shift, opening, not capable of distinguishing error-free and MU errors with-
and random errors were well separated into distinct groups, out the features from the DD-derived maps. Moreover, the
FIG 5. The normalized confusion matrixes of linear discriminant classifier (top left), Linear-supporting vector machine (SVM) (top right), RBF-SVM (bottom
left), and RF (bottom right) averaged over 1080 testing sets. [Color figure can be viewed at wileyonlinelibrary.com]
4. DISCUSSION
In this study, we applied ML models to address the detec-
tion and classification problems of the machine-related errors
TABLE III. Normalized confusion matrixes for the two-class (error-free and
any error) classification task by gamma threshold method applied under the
criteria 3%/3 mm and 2%/2 mm, and an ML model.
Gamma Gamma
threshold (3%/ threshold (2%/ ML model
3 mm) 2 mm) (Linear-SVM)
By the nature of the three error types, the shift and opening
errors would be more likely to introduce strip-like dose
changes perpendicular to the MLC motion directions,
whereas the dose changes induced by random leaf errors
tended to be small localized clusters.10 The resulting absolute
dose and dose gradient variation could be reflected by the
luminance and the contrast index maps (Fig. 2). Especially in
the area of large dose gradient, informative features could be
provided by these two difference maps on identifying the
shift, opening, and random errors. In addition, we observed
that the variance of the dose-changing trend brought by ran-
dom leaf errors could be reflected on the structure index
maps as well. Most notably, the ability of the model in distin-
guishing error-free and the MU errors was dramatically
FIG 8. Distributions of the gamma pass rate under the 2%/2 mm criterion improved by employing additional features from the DD-
with respect to the error type. The dashed line represents the 95% gamma derived map in model training process. In the meantime, the
pass rate.
information of the pixelwise dose difference provided by the
DD-derived map further enhanced the performance of the
based on the radiomic features obtained from the SSIM and model in identifying the three types of the MLC positioning
DD analysis of the EPID measurements in patient-specific errors. Though implementing deep learning approach in
dynamic IMRT QA. The results demonstrated that the pro- EPID image classification has been shown feasible for
posed method was capable of classifying the measured PD extracting discriminative features,11 the lack of interpretabil-
images in terms of leaf positioning and machine output ity of the deep learning features makes them difficult to be
errors. Compared with the pilot studies of Wootton et al. and associated with the visual patterns in the images. Thus, the
Nyflot et al.,10,11 this work incorporated additional error types empirical feature approach employed in this study gained
including leaf opening and machine output errors, and differ- insights into the contributions of the radiomic features of the
ent error magnitudes for error identifications. The best four difference maps to classification decisions. Furthermore,
achieved average OCA was 0.86 in our work, significantly this empirical feature approach remained practical for the rel-
higher than previously reported average OCA (0.643) for atively small datasets, since curating a large library of EPID
multiclass error identification. The accuracy improvement QA images with annotated error types remains challenging.
may result from the feature engineering in this study. Com- Despite promising OCA of ~0.86 was achieved by the two
paring to the DD-DTA-blended gamma maps, the four differ- SVM models, the mean SCAs of these models on the error-
ence maps could reflect different errors patterns resulted free and the MU error cases were relatively low (≤ 0:8) com-
from various beam delivery errors separately hence more paring to those of the other three error types. Since verifica-
informative features could be preserved. For example, posi- tion of the machine output was performed before each
tioning errors (i.e., MLC errors, EPID misalignment), can be measurement session and the error was found within 0:5%,
clearly reflected in contrast maps or DTA maps.34 In this the influence of the output variation on the measurements
work, the measured PD images were directly compared to the can be neglected. Moreover, the impact of the EPID response
calculated PD images for difference map calculation, to MU changes can be ignored as well since good linearity
accounting for the existing discrepancy between the PDIP- (≤ 0:4%) of the aS1200 EPID model, which was used in this
modeled images and acquired PD images. The obtained study, has been demonstrated in the previous study.16 There-
results in the current work further proved the feasibility of fore, the influence of the variation from the measurements on
such error detection methods in clinical environments. To our discriminating error-free and the MU errors can be ruled out.
knowledge, this is the first application of ML models with The main cause of such problem could be the accuracy of the
SSIM analysis to classify the patient-specific dynamic IMRT PDIP algorithm since the MLC transmitted radiation was not
QA results. modeled accurately enough in this algorithm.18,35 Large rela-
The radiomic features, including the first order statistics tive difference between the measured and the predicted PD
and the texture features, were captured from each of the dif- images of a dynamic IMRT field was found in low dose
ference maps. The first order statistical features quantified the regions, owning to the large amount of transmitted radiation
distribution of the pixel values meanwhile the texture features through the MLC leaves during the beam delivery. The DD-
calculated from the GLCMs, GLSZM, and GLDM quantified derived map was initially designed to evaluate relative local
the inter-pixel relationships within each map. The results differences, but intensified the differences in the low dose
showed that all of the radiomic feature categories and the dif- regions, as shown in Fig. 2. Furthermore, the transmission
ference maps made important contribution to the classifica- radiation is not completely linear to the machine output, due
tion of the delivery errors contained in the measured PD to MLC leaf modulations. Thus, the radiomic features of the
images. In particular, high accuracies (~0.9) were achieved error-free images and the images of MU errors for some
on the classification of the random, shift, and opening errors. fields can be indistinguishable for the ML models.
The aforementioned problem can be addressed from the distinguishing the minority class (class of no error) from the
following aspects to further enhance the classification perfor- majority class (class of MU errors). Nevertheless, the use of
mance of the ML models in the future study. First of all, mul- the inverse weighting from the dataset is not suited for every
tiple regions based on the relative dose level can be classifier, such as the RF classifier, where a large portion of
contoured, so that the features extracted from different the error-free cases were still misclassified as the cases of
regions would comprise dose level information. Secondly, a MU errors. Thus, an adaptive weighting approach may help
correction factor can be used to account for the MLC trans- finding a group of optimal weights for the classes and achiev-
mission in order to improve the agreement between the mea- ing better performance.38 Besides, the approaches of resam-
sured and the predicted PD images.35 Additionally, tuning the pling the dataset, for example, undersampling of the majority
standard deviation of the Gaussian distribution in DD analy- classes or oversampling of the minority classes,39,40 can be
sis may generate more discriminative patterns in DD-derived used to cope with class imbalance problem.
maps for the error-free and the MU errors. Moreover, apply- There were limitations of this study that should be
ing a CNN for feature extraction may capture more informa- noted. We only applied this method to detect the machine-
tive features from the difference maps. related errors in patient-specific QA for dynamic IMRT
Although the 3%/3 mm criterion is most commonly used plans. The feasibility of implementing the present method-
in the standard gamma analysis for IMRT QA, poor sensitiv- ology to VMAT QA needs further exploration. We believe
ity in detecting errors of this criterion has been demonstrated that the similar strip-like and cluster-like patterns would
in this study and by others.11 By contrast, the stricter crite- be brought by the systematic and the random MLC errors,
rion, 2%/2 mm exhibited dramatically improved error respectively, to the SSIM sub-index maps for VMAT QA
detectability. However, the detected errors could not be distin- results which appear to be discriminative to these errors.
guished owning to the highly overlapped gamma pass rate Moreover, we only simulated four types of machine-related
intervals of the error types. Therefore, the proposed radio- errors to investigate the feasibility of the present methodol-
mics-based ML models have great potential assisting the ogy. However, there are other error sources in beam deliv-
gamma analysis for error identification in IMRT QA proce- ery, such as collimator rotation, beam flatness, and
dure. symmetry, as well as gantry rotation in the delivery of
A major concern about the present work is the relatively VMAT plans. Including additional error types and expand-
small dataset. Model trained on a small number of observa- ing the proposed method to VMAT QA will be part of
tions tend to produce overfitted results. To mitigate the risk our future work.
of overfitting, two strategies were adopted: one was limiting The ML models were developed for relatively high-resolu-
the feature-set size in model training and the other one was tion EPID dosimetry in this study. However, the feasibility of
using nested CV scheme in model evaluation. We only implementing the proposed method to detectors of lower res-
selected the top 20 ranked features for our sample size of olution (e.g. diode arrays) is unclear. Especially for VMAT
180 beams, which obeyed the “one in ten” rule of thumb QA, dosimetry systems such as ArcCheck (Sun Nuclear, Mel-
for the number of predictive variables estimated from data bourne, FL)41 and Delta4 system (ScandiDos AB, Uppsala,
when doing regression analysis.36 Also, the feature-set size Sweden)42 are often utilized to identify errors in the integral
of 20 in this work was indicated reasonable to achieve dose distributions. It has been reported that SSIM analysis
promising model performance (Fig. 7). Unlike the conven- could be applied to the dose maps measured by detectors of
tional K-fold CV which uses the same data for hyperparam- lower resolutions by adjusting the local window size, which
eter tuning and model evaluation, the nested CV scheme determines the area used to calculate the local statistics in
used a series of splits of the whole dataset which allowed SSIM analysis.43 However, the impact on the radiomic fea-
one portion of each split for hyperparameter tuning and tures brought by the change of image resolution is currently
model training, and the remaining portion for model evalu- unknown. Further exploration is required to investigate the
ation. Especially for small dataset, the separation of the impact of the detector’s reduced resolution on the perfor-
data for hypermeter tuning and model evaluation in this mance of this method.
validation strategy effectively avoid overfitting caused by In this study, we chose to demonstrate error classification
information leakage.37 performance on the integral PD images per fraction (i.e.,
This study evaluated one feature selection strategy and inter-fraction QA). Another approach would be performing
four commonly used ML classifiers to find the optimal com- error classifications on the acquired image frames during the
bination for the five-class classification task. The classifica- delivery (i.e., intra-fraction), which would be advantageous
tion results showed that the Linear-SVM combined with the to identify certain types of errors, for example, MLC leaf
feature selector RFE-SVM had the best classification perfor- errors and MU errors. However, the integral influence from
mance. However, further research can be performed on the errors of all image frames would pose a challenge. In addi-
exploration of other feature selection methods and classifiers tion, incorporating this method into the error classification of
to achieve better performance. It is worth noting that the mis- an in vivo QA process would be of great benefit, since treat-
classification penalty we applied in model fitting process ment errors are often introduced by other factors rather than
effectively mitigated the bias resulting from uneven class dis- the machines (e.g. positioning errors, anatomical changes,
tributions and helped producing high accuracy in etc.).
11. Nyflot MJ, Thammasorn P, Wootton LS, et al. Deep learning for patient-
5. CONCLUSIONS specific quality assurance: Identifying errors in radiotherapy delivery by
radiomic analysis of gamma images with convolutional neural networks.
We proposed an ML-based method for machine-related Med Phys. 2019;46:456–464.
error identification in patient-specific dynamic IMRT QA, 12. Kimura Y, Kadoya N, Tomori S, et al. Error detection using a convolu-
where radiomics analysis on SSIM sub-index maps were used tional neural network with dose difference maps in patient-specific qual-
ity assurance for volumetric modulated arc therapy. Phys Medica.
for feature extraction. High error classification accuracies
2020;73:57–64.
were achieved in IMRT QA using this method and superior 13. Wang Z, Bovik AC, Sheikh HR, et al. Image quality assessment: from
sensitivity in detection errors has been demonstrated in con- error visibility to structural similarity. IEEE Trans Image Process.
trast with the traditional gamma threshold method. This 2004;13:600–612.
14. Peng J, Shi C, Laugeman E, et al. Implementation of the structural
method has great potential to assist the conventional gamma SIMilarity (SSIM) index as a quantitative evaluation tool for dose distri-
analysis for error indentification in IMRT QA process. bution error detection. Med Phys. 2020;47:1907–1919.
15. Bawazeer O, Sarasanandarajah S, Sisira Herath TK, et al. Sensitivity of
electronic portal imaging device (EPID) based transit dosimetry to detect
ACKNOWLEDGMENTS inter-fraction patient variations. Springer Singapore; 2019.
16. Miri N, Keller P, Zwan BJ, et al. EPID-based dosimetry to verify IMRT
This work was supported by the National Key R&D Pro- planar dose distribution for the aS1200 EPID and FFF beams. J Appl
Clin Med Phys. 2016;17:292–304.
gram of China (2019YFF01014405), National Natural 17. Varian Medical Systems. CTB PV: Installation and Verification of the
Science Foundation of China (No. 11505012), Beijing Munic- Portal Dosimetry; 2012:1–40.
ipal Administration of Hospitals Incubating Program (No. 18. Van Esch A, Depuydt T, Huyskens DP. The use of an aSi-based EPID
PX2019042), Ministry of Education Science and Technology for routine absolute dosimetric pre-treatment verification of dynamic
IMRT fields. Radiother Oncol. 2004;71:223–234.
Development Center (No. 2018A01019) and Natural Science 19. Van Esch A, Huyskens DP, Hirschi L, et al. Optimized varian aSi portal
Foundation of Beijing (No. 1202009). dosimetry: development of datasets for collective use. J Appl Clin Med
Phys. 2013;14:82–99.
20. Younge KC, Roberts D, Janes LA, et al. Predicting deliverability of volu-
CONFLICT OF INTEREST metric-modulated arc therapy (VMAT) plans using aperture complexity
analysis. J Appl Clin Med Phys. 2016;17:124–131.
The authors have no conflict to disclose. 21. Carlone M, Cruje C, Rangel A, et al. ROC analysis in patient specific
quality assurance. Med Phys. 2013;40:1–7.
22. Rangel A, Dunscombe P. Tolerances on MLC leaf position accuracy for
† IMRT delivery with a dynamic MLC. Med Phys. 2009;36:3304–3309.
These authors are contributed equally to this work.
a) 23. Dische S, Saunders MI, Williams C, et al. Precision in reporting the dose
Author to whom correspondence should be addressed. Electronic mail:
given in a course of radiotherapy. Radiother Oncol. 1993;29:287–293.
[email protected].
24. Van Griethuysen JJM, Fedorov A, Parmar C, et al. Computational radio-
mics system to decode the radiographic phenotype. Cancer Res.
2017;77:e104–e107.
REFERENCES 25. Zwanenburg A, Vallières M, Abdalah MA, et al. The image biomarker
1. Ezzell GA, Galvin JM, Low D, et al. Guidance document on delivery, standardization initiative: standardized quantitative radiomics for high-
treatment planning, and clinical implementation of IMRT: report of the throughput image-based phenotyping. Radiology. 2020;295:328–338.
IMRT subcommittee of the AAPM radiation therapy committee. Med 26. Hastie T, Tibshirani R, Friedman J. The Elements of Statistical Learning:
Phys. 2003;30:2089–2115. Data Mining, Inference, and Prediction, 2nd edn. New York: Springer; 2009.
2. Miften M, Olch A, Mihailidis D, et al. Tolerance limits and methodolo- 27. Guyon I, Weston J, Stephen B. Gene selection for cancer classification
gies for IMRT measurement-based verification QA: recommendations of using support vector machines. Mach Learn. 2002;46:389–422.
AAPM Task Group No. 218. Med Phys. 2018;45:e53–e83. 28. Pedregosa F, Varoquaux G, Gramfort A, et al. Scikit-learn: machine
3. Zhu XR, Jursinic PA, Grimm DF, et al. Evaluation of kodak EDR2 learning in Python. J Mach Learn Res. 2011;12:2825–2830.
film for dose verification of intensity modulated radiation therapy 29. Ledoit O, Wolf M. Honey, I shrunk the sample covariance matrix. J
delivered by a static multileaf collimator. Med Phys. 2002;29: Portf Manag. 2004;30:110–119.
1687–1692. 30. Wainer J, Cawley G. Nested cross-validation when selecting classifiers
4. Jursinic PA, Nelms BE. A 2-D diode array and analysis software for veri- is overzealous for most practical applications; 2018.
fication of intensity modulated radiation therapy delivery. Med Phys. 31. Varma S, Simon R. Bias in error estimation when using cross-validation
2003;30:870–879. for model selection. BMC Bioinform. 2006;7:91.
5. Wu C, Hosier KE, Beck KE, et al. On using 3D γ-analysis for IMRT and 32. Low DA, Harms WB, Mutic S, et al. A technique for the quantitative
VMAT pretreatment plan QA. Med. Phys. 2012;39:3051–3059. evaluation of dose distributions. Med Phys. 1998;25:656–661.
6. Low DA, Moran JM, Dempsey JF, et al. Dosimetry tools and techniques 33. Pan Y, Yang R, Zhang S, et al. National survey of patient specific IMRT
for IMRT. Med Phys. 2011;38:1313–1338. quality assurance in China. Radiat Oncol. 2019;14:1–10.
7. Nelms BE, Zhen H, Tomé WA. Per-beam, planar IMRT QA passing rates 34. Potter NJ, Mund K, Andreozzi JM, Li JG, Liu C, Yan G. Error detection
do not predict clinically relevant patient dose errors. Med Phys. and classification in patient-specific IMRT QA with dual neural net-
2011;38:1037–1044. works. Med Phys. 2020;352:4711–4720.
8. Kruse JJ. On the insensitivity of single field planar dosimetry to IMRT 35. Vial P, Greer PB, Hunt P, et al. The impact of MLC transmitted radiation
inaccuracies. Med Phys. 2010;37:2516–2524. on EPID dosimetry for dynamic MLC beams. Med Phys. 2008;35:
9. Kry SF, Molineu A, Kerns JR, et al. Institutional patient-specific IMRT 1267–1277.
QA does not predict unacceptable plan delivery. Int J Radiat Oncol Biol 36. Harrell FE, Lee KL, Mark DB. Multivariable prognostic models: issues
Phys. 2014;90:1195–1201. in developing models, evaluating assumptions and adequacy, and mea-
10. Wootton LS, Nyflot MJ, Chaovalitwongse WA, et al. Error detection in suring and reducing errors. Stat Med. 1996;15:361–387.
intensity-modulated radiation therapy quality assurance using radiomic 37. Cawley GC, Talbot NLC. On over-fitting in model selection and subse-
analysis of gamma distributions. Int J Radiat Oncol Biol Phys. quent selection bias in performance evaluation. J Mach Learn Res.
2018;102:219–228. 2010;11:2079–2107.
38. Huang W, Song G, Li M, Hu W, Xie K. Adaptive weight optimization 41. Ĺtourneau D, Publicover J, Kozelka J, et al. Novel dosimetric phantom
for classification of imbalanced data. Lect Notes Comput Sci. for quality assurance of volumetric modulated arc therapy. Med Phys.
2013;8261:546–553. 2009;36:1813–1821.
39. Chawla NV, Bowyer KW, Hall LO, et al. SMOTE: synthetic minority 42. Bedford JL, Lee YK, Wai P, et al. Evaluation of the Delta4 phantom for
over-sampling technique. J Artif Intell Res. 2002;16:321–357. IMRT and VMAT verification. Phys Med Biol. 2009;54:N167–N176.
40. Liu XY, Wu J, Zhou ZH. Exploratory undersampling for class-imbal- 43. Shi C, Lim S, Chan M. Evaluation of a Transmission Detector on IMRT
ance learning. IEEE Trans Syst Man Cybern Part B Cybern. QA Using Structure Similarity Index (SSIM), poster presented at:
2009;39:539–550. AAPM annual meeting 2018, ePoster ID: TU-C1030-GePD-F6-3.