An Enhanced Monte Carlo Outlier Detection Method
An Enhanced Monte Carlo Outlier Detection Method
ORG
Outlier detection is crucial in building a highly predictive removed, the value of validation by Kovats retention indices
model. In this study, we proposed an enhanced Monte Carlo and the root mean square error of prediction decreased from
outlier detection method by establishing cross-prediction 3.195 to 1.655, and the average cross-validation prediction
models based on determinate normal samples and analyzing error decreased from 2.0341 to 1.2780. This method helps
the distribution of prediction errors individually for dubious establish a good model by eliminating outliers. V
C 2015 Wiley
samples. One simulated and three real datasets were used to Periodicals, Inc.
illustrate and validate the performance of our method, and the
results indicated that this method outperformed Monte Carlo DOI: 10.1002/jcc.24026
outlier detection in outlier diagnosis. After these outliers were
MC outlier detection
Figure 2. Mean/STD plot of prediction errors for Dataset 1: Enhanced Monte Carlo outlier detection (left) and Monte Carlo outlier detection (right).
select 40–60% of the samples with the smallest MV and STD Results and Discussion
of prediction errors, and determine the remaining samples as
Enhanced Monte Carlo outlier detection method
dubious samples; (3) randomly divide the selected data (Ns)
into training and validation sets; (4) after the number of prin- Outlier detection is an important step in building a highly pre-
ciple components is determined by cross-validation, build the dictive model. MOCD was recently developed to provide a fea-
prediction model with the training set and use it to predict sible means of detecting different kinds of outliers by
the samples in the dubious samples in the validation set to establishing many predictive models and a MV/STD plot of
obtain the prediction error; (5) after N cycles, obtain the pre- prediction errors for all samples. This outlier detection method
diction error distribution for the dubious samples; and (6) use depends on the graphic MV/STD plot, so the key is to deter-
the MV and STD of the error distribution on the dubious sam- mine the visualized boundary between normal and abnormal
ples to test whether the dubious samples are outliers. Accord- samples.
ing to the hypothesis of this outlier detection method, the To illustrate our method, a simulated dataset was designed,
MVs and STDs of their prediction errors decrease, while those which contains 100 normal samples, 20 X outliers, and 20 y
of the outliers increase to some extent. As the masking effect outliers. MCOD was initially conducted to detect the outliers.
could be eliminated by EMCOD, it could provide better results As shown in Figure 2, two kinds of outliers have a clear tend-
than MCOD. ency to separate from the normal samples. The y outliers have
larger prediction errors than normal samples while X outliers
(good leverage point) have large STD values than normal sam-
ples. However, the boundary between the outliers and the
Data processing and analysis
normal samples is indistinct, making it difficult to determine
All programs used were coded in MATLAB 2011a for Win- whether a sample far from the original point is an outlier or
dows and all calculations were performed on a personal com- not. In this case, EMCOD was performed to detect the outliers
puter. The MATLAB implementation of EMCOD is available in this simulated dataset. As an enhancement of MCOD, the
from http://www.mathworks.com/matlabcentral/fileexchange/ MVs and STDs of prediction errors acquired from MCOD were
52023-emcod. used to select out 60 normal samples with the smallest MVs
Figure 3. Mean/STD plot of prediction errors for Dataset 2: Enhanced Monte Carlo outlier detection (left) and Monte Carlo outlier detection (right).
Figure 4. Mean/STD plot of prediction errors for Dataset 3: Enhanced Monte Carlo outlier detection (left) and Monte Carlo outlier detection (right). [Color
figure can be viewed in the online issue, which is available at wileyonlinelibrary.com.]
and STDs. When the number (N) of Monte Carlo models and 17 were normal samples, which had the smallest mean and
sampling ratio are, respectively, set to 10,000 and 0.8, the MVs STD values. We established MC prediction models using these
and STDs of prediction errors could be used to determine 11 samples and used these models to observe other samples.
whether the dubious samples are outliers. As shown in Figure The number (N) of Monte Carlo models and sampling ratio are
2, the samples in this simulated dataset were noticeably classi- also set to 10,000 and 0.8, respectively. According to the
fied into four groups. The distances between outliers and nor- hypothesis that the models built with merely normal samples
mal samples significantly increase, and 20 X outliers and 20 y provide lower prediction errors for normal samples but higher
outliers could be easily identified from the MV/STD plot of pre- prediction errors for outliers, the distances between normal
diction errors. It is absolutely obvious that EMCOD could pres- samples and outliers should be longer. The result is shown in
ent a better result in correctly detecting outliers. Figure 3 (left), which illustrates that EMCOD has a better result
as the outliers have correctly been detected.
With the help of an MC outlier detector, normal samples
Method validation
with the smallest MVs and STDs of prediction errors could be
Dataset 2 is the stack loss dataset of a plant. In MCOD, the easily detected, even though it was hard to determine the
number (N) of Monte Carlo models and sampling ratio are set boundary between normal samples and outliers. We selected
to 10,000 and 0.8, respectively. The MV/STD plot of the predic- some normal samples with the smallest MVs and STDs of pre-
tion errors for 21 samples was shown on the right of Figure 3. diction errors and then determined whether other samples
Lacking the information about this commonly used dataset, it were outlier one after another.
is hard to determine the boundary for outlier detection. To Dataset 3 represents the Hawkins–Bradu–Kass data. As
obtain a clearer result, enhanced MOCD was proposed and shown on the right of Figure 4, the M/SD plot indicates that
used to detect outliers in this dataset. As shown in Figure 3, 14 samples (No. 1–14) are outliers. 52 samples with the lowest
the samples including 20, 5, 16, 18, 19, 13, 14, 8, 15, 10, and STDs of prediction errors (<0.5) were selected as normal
Figure 5. Mean/STD plot of prediction errors for Dataset 4. [Color figure can be viewed in the online issue, which is available at wileyonlinelibrary.com.]
Keywords: outlier detection enhanced Monte Carlo outlier Received: 18 June 2015
Accepted: 1 July 2015
detection validation Published online on 31 July 2015