0% found this document useful (0 votes)
35 views

Computers and Geosciences 125 (2019) 9-18

The document discusses using a Gaussian mixture model to separate geochemical anomalies from sample data of unknown distribution. It compares using a GMM to a one-class support vector machine on geochemical survey data from China. The GMM took less time to run and performed comparably to the SVM in detecting anomalies and known mineral deposits.

Uploaded by

Anwar Shah
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
35 views

Computers and Geosciences 125 (2019) 9-18

The document discusses using a Gaussian mixture model to separate geochemical anomalies from sample data of unknown distribution. It compares using a GMM to a one-class support vector machine on geochemical survey data from China. The GMM took less time to run and performed comparably to the SVM in detecting anomalies and known mineral deposits.

Uploaded by

Anwar Shah
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Computers and Geosciences 125 (2019) 9–18

Contents lists available at ScienceDirect

Computers and Geosciences


journal homepage: www.elsevier.com/locate/cageo

Separation of geochemical anomalies from the sample data of unknown T


distribution population using Gaussian mixture model
Yongliang Chena,∗,1, Wei Wub,2
a
Institute of Mineral Resources Prognosis on Synthetic Information, Jilin University, Changchun, Jilin Province, 130026, PR China
b
Changchun Institute of Urban Planning and Design, Changchun, Jilin Province, 130033, PR China

ARTICLE INFO ABSTRACT

Keywords: The separation of geochemical anomalies from the sample data of unknown distribution population is a great
Geochemical anomaly challenge, as it is difficult to determine the correct model for the unknown population distribution. Gaussian
Gaussian mixture model mixture model is a linear combination of several Gaussians. By using enough number of Gaussians and by
One-class support vector machine adjusting parameters, the model can generate very complex probability density, which can approximate almost
Receiver operating characteristic
any continuous probability. Therefore, the Gaussian mixture model can fit the sample data of unknown dis-
Area under the curve
Youden index
tribution population, and those data points that do not conform to the model are considered as anomalies. The
method was used to separate multivariate anomalies from the geochemical survey data of 1:200,000 scale
collected from the Baishan district, Jilin Province, China, and compared with one-class support vector machine.
The programs running the two models took 18.67 and 32.14 s, respectively; the receiver operating characteristic
curves of the two models intersect each other in the ROC space; and area under the curves of the two models are
0.851 and 0.855 respectively. The “best” threshold determined by using the Youden index was used to separate
geochemical anomalies. The anomalies separated from the modeling results of the two models occupy respec-
tively 14.46% and 14.49% of the study area and contain respectively 83% and 70% of the known mineral
deposits. Therefore, Gaussian mixture model is comparable to one-class support vector machine in geochemical
anomaly detection. It can be used as a geochemical anomaly detector with high performance and data modeling
efficiency.

1. Introduction population distribution. A GMM may then be used to model the geo-
chemical data for expressing the background, and those data points that
Gaussian mixture model (GMM) is a linear combination of several do not conform to the model are separated as anomalies. As a demon-
Gaussians. By using a sufficient number of Gaussians and by adjusting stration, GMM was used to separate geochemical anomalies from the
their means and covariances as well as the coefficients in the linear stream sediment survey data of 1:200,000 scale collected from the
combination, GMM can fit almost any continuous probability (Bishop, Baishan district, Jilin Province, China and compared with one-class
2006). Given the sample data drawn from an unknown complex dis- support vector machine (OCSVM) (Chen and Wu, 2017b, c). The pro-
tribution, the maximum likelihood parameters of the GMM can be de- gram run time (PRT), receiver operating characteristic (ROC) and area
termined by the expectation maximization (EM) algorithm (Dempster under the curve (AUC) (Chen, 2015; Chen and Wu, 2016, 2017a, b, c)
et al., 1997; Lindstrom and Bates, 1988; van Dyk, 2000). GMMs have were used to evaluate the data-modeling efficiencies and performances
been applied in data classification, image segmentation, target dis- of GMM and OCSVM in geochemical anomaly detection. The main
crimination, and novelty detection (Drews-Jr., 2013; Huang and Chau, contribution of this paper is that a GMM-based anomaly detector with
2008; Khanmohammadi and Chou, 2016; Kim and Kang, 2007; Li et al., high performance and data-modeling efficiency is developed for se-
2016; Simms et al., 2018). parating geochemical anomalies from the sample data of unknown
In geochemical exploration, one may assume that geochemical data distribution population.
points are drawn independently and randomly from an unknown


Corresponding author.
E-mail address: [email protected] (Y. Chen).
1
Geochemical anomaly detection using GMM and OCSVM and manuscript writing.
2
Geochemical data preprocessing and thematic map generating using Surfer and Grapher.

https://doi.org/10.1016/j.cageo.2019.01.010
Received 2 August 2018; Received in revised form 4 November 2018; Accepted 14 January 2019
Available online 18 January 2019
0098-3004/ © 2019 Elsevier Ltd. All rights reserved.
Y. Chen, W. Wu Computers and Geosciences 125 (2019) 9–18

2. GMM-based geochemical anomaly detector from component l and the lth mixing coefficient can be, respectively,
calculated by
A n × m matrix X is used to express the geochemical data set, n
{x1, x2, , xn} , with the ith data point xi = (x i1, xi2, , x im)T . The entry nl = p (z l = 1|x i ),
x ij (i = 1,2, , n ; j = 1,2, , m) represents the observed value of element i=1 (6)
j on data point i. The geochemical data need to be standardized so that and
each element has zero mean and unit standard deviation. The stan-
n
dardized value of element j is expressed as =
1
p (z l = 1|xi ), (l = 1,2, , L).
l
n (7)
x ij x¯j i=1
x ij = , (j = 1,2, …, m )
j (1) where nl is the effective number of data points drawn from component l;
l is the lth mixing coefficient; and p (z l = 1|x i ) is the responsibility that
where x ij and x ij are, respectively, the observed and standardized va- can be computed by Eq. (5).
lues of element j; and x̄ j and j are, respectively, the mean and standard
Bishop (2006) showed that by setting the derivatives of the loga-
deviation of element j.
rithmic likelihood in Eq. (3) with respect to the means µl to zero, the
Assuming that the n data points are drawn independently and ran-
following maximum likelihood solution for µl is obtained:
domly from an unknown distribution, the data distribution can be re-
n
presented by the following GMM: 1
µl = p (z l = 1|xi ) xi , (l = 1,2, , L).
L nl i=1 (8)
p (x ) = l g ({x|µl , l}).
l (2) where p (z l = 1|xi ) and nl can be, respectively, calculated by Eqs. (5)
and (6).
where p (x ) is the marginal probability; L is the number of Gaussians; l Bishop (2006) proved that by setting the derivative of the loga-
is the lth mixing coefficient that satisfies 0 1 and l l = 1; and
L
l rithmic likelihood in Eq. (3) with respect to l to zero, and by making
g ({x|µl , l}) is the Gaussian density called the lth component and has its use of the result for the maximum likelihood solution for the covariance
own mean µl and covariance l . The mixing coefficient l can be viewed matrix of a single Gaussian, the following maximum likelihood solution
as the prior probability of picking the lth component, and the density for l is obtained:
g ({x|µl , l}) can be viewed as the probability of data point x condi- n
tioned on the lth component (Bishop, 2006). 1
l = p (z l = 1|xi ) (x i µ l ) (x i µl ) T , (l = 1,2, , L).
The notations π = {π1, …, πL}, μ = {μ1, …, μL} and Σ = {Σ1, …, ΣL} nl i=1 (9)
are used to represent all the parameters of the GMM of Eq. (2). These
where p (z l = 1|xi ) and nl can be, respectively, calculated by Eqs. (5)
parameters can be determined by maximizing the following logarithmic
and (6); and µl can be calculated by Eq. (8).
likelihood:
The EM algorithm for determining the GMM parameters starts with
n L randomly initialized parameter values of π, μ, Σ. First, estimate re-
lnp (X | , µ , )= ln l g (x i |µl , l) , sponsibilities of the L latent variables using the current parameter va-
i=1 l=1 (3) lues of π, μ, Σ, and then seek the maximum likelihood solution for
where X is the known geochemical data set. parameters corresponding to the latent variables. Keep alternating until
Eq. (3) involves the unknown parameters π, μ, Σ as well as the the resulting values converge to fixed points. The pseudo-code for this
unobserved latent variables that determine the component from which iteration is outlined in Table 1.
the data points originate. Finding a maximum likelihood solution of Eq. Based on the output values of π, μ, Σ, the probability p (xi ) of each
(3) leads to a set of unsolvable equations in which the solution to the data point xi, (i = 1, 2, …, n), can be calculated by Eq. (2). This
latent variables needs the known parameter values and vice versa. probability can be viewed as the degree to which the data point xi
Fortunately, this problem can be solved by an EM algorithm (Lindstrom conforms to the GMM. Based on p (xi ) , the anomaly degree of data point
and Bates, 1988; van Dyk, 2000) but the algorithm cannot guarantee xi is defined as
finding the global maximum (Wu, 1983). s (xi ) = max {lnp (xk )} lnp (xi ), (i = 1,2, , n)
Let's use a n × L matrix Z = (z il )n × L to express unobserved data of 1 k n (10)
the L latent variables. The entry z il satisfies z il {0, 1} and l = 1 z il = 1.
L
where s (xi ) is the anomaly degree of data point xi, (i = 1, 2, …, n). It
When the data point xi originates from component l, z il = 1, otherwise
has a non-negative value and is negatively correlated with the loga-
z il = 0 . Using the known parameters π, μ and Σ, Bishop (2006) proved
rithmic probability lnp (xi ) . It can be regarded as the degree that the
the following relationship between z l = 1 and l :
data point xi does not conform to the model. The larger value of s (xi )
p (z l = 1) = l, (l = 1,2, , L). (4) the more likely that xi is an anomaly data point.
For geochemical anomaly detection, GMM is first used to model the
where p (z l = 1) is the prior probability of z l = 1, and l is the lth mixing
standardized geochemical data to represent the background, and then
coefficient.
Eq. (10) is used to compute s (xi ) for each data point xi. A threshold is
According to the Bayesian theorem, we can use the known para-
eventually used to identify anomaly data points that clearly do not
meters π, μ and Σ to express the conditional probability of z l = 1 given
conform to the GMM model. If there are some known mineral deposits
xi as follows:
in the study area, the Youden index can be used to determine the “best”
l g (x i | µl , l ) threshold (BT) (Chen, 2015; Chen and Wu, 2016, 2017a, b, c). How to
p (z l = 1|xi ) = L
, (l = 1,2, , L; i = 1,2, …, n). determine BT is discussed in Section 5.2.
k=1 k
g (xi |µk , k ) (5)

where p (z l = 1|xi ) is the posterior probability of z l = 1 given xi . It can 3. Study area and geochemical data
be viewed as the responsibility that the lth component takes for ‘ex-
plaining’ xi (Bishop, 2006). The study area is located in the Baishan district, Jilin Province,
The n × L responsibilities can be computed by Eq. (5) and used to China. It covers four geological maps of 1:200,000 scale. The stream
solve the maximum likelihood solution for the parameters π, μ and Σ. sediment survey data were collected from about 26,500 km2 in the
According to Bishop (2006), the effective number of data points drawn study area.

10
Y. Chen, W. Wu Computers and Geosciences 125 (2019) 9–18

Table 1
Pseudo-code for the parameter estimation of GMM using the EM algorithm.

3.1. Geology and polymetallic mineralization any given grid point. The stream sediment survey data of the 35 ele-
ments were transformed into the 35 grid maps.
The study area is in the north margin of the north China platform and The AUCs and ZAUCs were computed based on each of the 35 grid
had experienced a complex geological evolution (Liu et al., 2000; Wu et al., maps and the known mineral deposit locations in the study area. The 30
2005; Qin et al., 2014). Widely exposed geological formations include the known mineral deposit locations in the study area were used as “the
Archean plutonic rocks, the Paleoproterozoic metamorphic rocks, the ground truth data” for defining the true positive and true negative
Neoproterozoic-Paleozoic sedimentary rocks, and the Mesozoic volcanic points. If a unit cell represented by a grid point contains a mineral
and volcanic-sedimentary rocks as well as the Paleoproterozoic gneiss deposit, the grid point is called a true positive point; otherwise it's
granites and the Mesozoic granites (Fig. 1). The Ji'an-Songjiang tectonic belt called a true negative point. Suppose that there are p true positive
traverses the whole study area and controls the spatial distribution of points and q true negative points in the study area. According to the
geological formations (Zhao et al., 1993; Zhang et al., 2006). Wilcoxon test of ranks (Bergmann et al., 2000; Chen, 2015; Chen and
The Paleoproterozoic metamorphic formations provided funda- Wu, 2016), the AUC of an element can be calculated by
mental substances for polymetallic mineralization (Wu et al., 1992; p q
1
Zhang et al., 2011; Zhong et al., 2014). The Mesozoic granites and AUC = (xi , yj )
granite porphyries provided heat sources for the polymetallic miner- pq i=1 j=1 (11)
alization and some mineralization substances (Zheng, 1995; Liu et al.,
with
2000; Wu et al., 2005). Along the Ji'an-Songjiang tectonic belt and
around the Mesozoic granites and granite porphyries, about 30 hy- 1, xi > yj
drothermal deposits have been discovered (Yang et al., 1999; Liu et al., (xi , yj ) = 0.5, xi = yj
2009; Li et al., 2010). 0, xi < yj

where xi (i = 1, 2, …, p) represents the concentration value of the


3.2. Geochemical survey data element at the ith true positive point and yj (j = 1, 2, …, q) represents
the concentration value of the element at the jth true negative point.
The stream sediment survey data come from the China's National According to Chen (2015), ZAUC can be written as
Geochemical Mapping Project (Xie et al., 1997). The sampling density
was 1 sample per 4 km2and six major and 29 minor and trace elements AUC 0.5
ZAUC =
were, respectively, analyzed by AAS and ICP-AES. There were 6608 SEAUC (12)
stream sediment samples used for this study. The stream sediment with
survey data of each element were transformed into a 100 × 150 grid
map using the interpolating method of Inverse Distance to a Power in AUC (1 AUC ) + (p 1) ( AUC
)
AUC2 + (q 1)
2AUC2
AUC2
2 AUC 1 + AUC
the Golden Software Surfer of version 12. Each grid point in the map SEAUC =
pq
represents an unit cell size of 1.2167 × 1.6365 km2. This cell size sa-
tisfies that only one mineral deposit presents in the cell represented by where SEAUC is the standard deviation of AUC.

11
Y. Chen, W. Wu Computers and Geosciences 125 (2019) 9–18

Fig. 1. Simplified geologic map and discovered mineral deposits (Chen and Wu, 2016).

The AUC reflects spatial relationship between element concentra- et al., 2009; Li et al., 2010). The high-metamorphic rocks are rich in
tions and mineral deposit locations. A value of AUC is in range of 0.5–1, boron and the low-metamorphic rocks are rich in CaO and MgO. Ac-
which corresponds respectively to random and deterministic relation- cordingly, boron, CaO, and MgO were chosen as metallogenic in-
ships between element concentrations and mineral deposit locations. dicators. According to the characteristics of the known hydrothermal
The ZAUC is a normal distributed statistic for testing whether an AUC is mineral deposits in the study area, Au, Ag, Cu, Pb, Zn, and Co are the
significantly different from 0.5. If the estimated value of ZAUC is greater primary metallogenic elements; and As, Sb, Bi, and Hg are the asso-
than the critical value of 1.96 at the significance level of 0.05, the ciated metallogenic elements. Hence, gold, Ag, Cu, Pb, Zn, Co, As, Sb,
spatial relationship between element concentrations and mineral de- Bi, and Hg were chosen as metallogenic indicators. Finally, gold, Ag, As,
posit locations is considered to be significant. Table 2 lists the estimated B, Bi, CaO, Co, Cu, Hg, MgO, Pb, Sb, and Zn were chosen as the me-
AUCs and ZAUCs of 35 elements. It shows that Au, Ag, As, B, Bi, CaO, Cd, tallogenic indicators.
Cu, Hg, MgO, P, Pb, Sb, Sn, V, W, and Zn are effective candidate me-
tallogenic indicators of hydrothermal deposits in the study area because
their ZAUCs are greater than the critical value of 1.96. 4. Geochemical data modeling
Through the analysis of the above statistical results and the analysis
of the geological and mineralization characteristics of the study area, GMM and OCSVM were used to model the standardized stream se-
the metallogenic indicators were selected from the 35 elements. The diment survey data. The Python codes from the scikit learn were used
Paleoproterozoic formations have genetic relationships with the poly- for GMM and OCSVM modeling; and the Python codes developed by
metallic mineralization (Wu et al., 1992; Yang et al., 1999, 2001; Liu Yongliang Chen was used for data input and output as well as perfor-
mance evaluation.

Table 2
AUCs and ZAUCs for 35 elements (Chen and Wu, 2016).
Element AUC ZAUC Element AUC ZAUC Element AUC ZAUC

Ag 0.741 4.616 Cu 0.801 6.190 Pb 0.718 4.095


Al2O3 0.567 1.231 Fe2O3 0.558 1.073 Sb 0.720 4.141
As 0.717 4.074 Hg 0.682 3.365 SiO2 0.385 −2.443
Au 0.773 5.412 La 0.484 −0.306 Sn 0.667 3.071
B 0.626 2.301 Li 0.590 1.651 Sr 0.555 1.019
Ba 0.491 −0.179 MgO 0.817 6.692 Ti 0.427 −1.477
Be 0.312 −4.560 Mn 0.501 0.011 V 0.683 3.373
Bi 0.742 4.629 Mo 0.498 −0.037 W 0.675 3.222
CaO 0.785 5.735 Na2O 0.407 −1.935 Y 0.386 −2.426
Cd 0.798 6.109 Nb 0.298 −5.035 Zn 0.670 3.120
Co 0.527 0.497 Ni 0.605 1.911 Zr 0.239 −7.630
Cr 0.556 1.029 P 0.697 3.663

12
Y. Chen, W. Wu Computers and Geosciences 125 (2019) 9–18

Table 3 Table 4
AUCs and PAAs for all L values in the GMM modeling. AUCs for all pairs of σs and vs in the OCSVM modeling.
Modeling times First Second Third σ AUC v 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

L AUC PAA (%) AUC PAA (%) AUC PAA (%) 0.1 0.745 0.779 0.799 0.808 0.812 0.815 0.819 0.821 0.822
1 0.818 8.36 0.818 8.36 0.818 8.36 0.2 0.763 0.779 0.801 0.812 0.821 0.826 0.829 0.830 0.830
2 0.845 16.19 0.846 14.93 0.846 14.92 0.3 0.767 0.786 0.807 0.821 0.830 0.835 0.836 0.836 0.835
3 0.850 14.49 0.850 14.49 0.850 14.49 0.4 0.777 0.791 0.813 0.828 0.837 0.843 0.843 0.842 0.840
4 0.851 14.46 0.851 14.46 0.851 14.20 0.5 0.782 0.799 0.819 0.834 0.843 0.848 0.850 0.846 0.844
5 0.843 15.05 0.843 15.05 0.843 15.05 0.6 0.789 0.808 0.824 0.838 0.847 0.852 0.855 0.851 0.848
6 0.856 42.10 0.834 39.55 0.841 36.75 0.7 0.780 0.817 0.827 0.842 0.851 0.855 0.858 0.856 0.851
7 0.857 36.67 0.858 33.43 0.861 47.15 0.8 0.809 0.826 0.832 0.847 0.854 0.858 0.860 0.860 0.855
8 0.851 23.26 0.841 36.41 0.851 35.39 0.9 0.817 0.831 0.838 0.850 0.856 0.860 0.862 0.862 0.859
9 0.852 37.84 0.851 45.22 0.847 40.05 1.0 0.817 0.832 0.846 0.852 0.857 0.861 0.863 0.864 0.862
10 0.855 37.63 0.855 39.81 0.846 45.39

Table 5
4.1. GMM modeling PAAs for all pairs of σs and vs in the OCSVM modeling.
v PAA 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
The number of mixing components (L) needs to be defined for GMM. (%)σ
To test how the performance of GMM varies with the value of L, let the
value of L start at 1 and increase by 1 at a time, all the way to 10. For 0.1 18.25 13.73 12.37 16.27 16.25 18.14 14.82 14.36 14.13
each value of L, the geochemical data were modelled three times using 0.2 8.96 14.89 13.73 11.81 12.57 14.02 11.37 11.39 11.45
0.3 13.90 10.23 17.67 13.55 12.04 12.61 13.73 10.86 10.77
the GMM, and the anomaly degree of each data point was calculated
0.4 11.6 10.46 22.59 16.17 13.68 12.51 12.65 13.68 14.59
using Eq. (10). For each modeling result, the AUC value was calculated 0.5 16.66 10.61 10.94 19.21 15.41 13.93 13.21 12.76 13.61
using Eq. (11), and the percentage of anomaly areas (PAA) was esti- 0.6 22.14 10.91 7.89 27.94 18.46 18.69 14.49 13.92 13.75
mated based on the geochemical anomalies separated by the method 0.7 20.74 37.49 7.89 26.90 19.45 18.18 18.83 15.44 14.92
0.8 41.10 37.94 37.23 37.52 39.12 18.54 18.32 25.35 25.44
described in Section 5.2. Table 3 lists the AUCs and PAAs estimated
0.9 45.37 42.72 43.98 37.27 24.61 41.43 25.09 24.79 24.77
using the modeling results; and Fig. 2 shows the curves of AUC and PAA 1.0 42.00 42.23 40.59 35.25 36.87 39.58 20.61 20.63 24.04
varying with L.
Fig. 2a shows that AUC increases as L increases and fluctuates when
L ≥ 5. Fig. 2b shows that PAA is less than 0.162 when L < 5 while it distribution but located outside of the support subset. By referring to
jumps to more than 0.23 when L ≥ 5. According to this result, L = 4 is Chen and Wu (2017c), anomaly degree of data point x can be written as
considered to be optimal for the GMM modeling. This result is also n
consistent with the geological characteristics because only five litho- f (x ) = i [K (x i , x j ) K (xi , x )], j [1,2, …, n].
logic formations are widely-exposed in the study area. Therefore, the i=1 (13)
GMM modeling result at L = 4 was used to separate geochemical
anomalies. where f (x ) is anomaly degree of data point x; i, (i = 1,2, …, n), is the
Lagrange parameter; K ( , ) is Gaussian kernel; and n is the number of
data points.
4.2. OCSVM modeling The parameters σ and v need to be determined for OCSVM by trial
and error. In this study, each pair of σ and v were selected respectively
An OCSVM can model geochemical data without any assumptions from the two sequences of {0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0}
on the data distribution. The model seeks a subset in the input space to and {0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9}, and used to initialize
support the high-dimensional distribution of input data; and anomaly OCSVM. The initialized OCSVM was trained on all the data points, and
data points are those which are drawn from the high-dimensional then the anomaly degree of each data point was calculated by Eq. (13).

Fig. 2. Curves of AUC and PAA varying the number of mixing components.

13
Y. Chen, W. Wu Computers and Geosciences 125 (2019) 9–18

Fig. 3. Curves of AUC and PAA varying with σ and v.

Table 6 considered to be optimal for the OCSVM modeling. Therefore, the


Statistics for the GMM and OCSVM modeling results. OCSVM modeling result at v = 0.7 and σ = 0.6 was used to separate
Statistics GMM (L = 4) OCSVM (v = 0.7; σ = 0.6) geochemical anomalies.

AUC 0.851 0.855


5. Results and discussion
SAUC 0.0441 0.0436
ZAUC 7.953 8.132
LYI 0.557 0.556 The performances of GMM and OCSVM in geochemical anomaly
BT 30.611 264.932 detection were evaluated using ROC and AUC, and the data modeling
PAA (%) 14.46 14.49 efficiencies of GMM and OCSVM were evaluated using PRT. The BT for
Benefit (%) 83 70
separating geochemical anomalies was determined by using the Youden
PRT (s) 18.67 32.14
index (Chen, 2015; Chen and Wu, 2016, 2017a, b, c).

The AUC and PAA for each modeling result were estimated using the 5.1. Performance evaluation
methods describe in Section 4.1. Tables 4 and 5 list the AUCs and PAAs,
respectively, for all pairs of σs and vs in the OCSVM modeling; and If the GMM and OCSVM algorithms perform well in geochemical
Fig. 3 shows the curves of AUC and PAA varying with σ and v. anomaly detection, their modeling results should be highly spatially
Fig. 3a shows that AUC increases rapidly with increase of v and tend associated with the known mineral deposit locations. These spatial re-
to be stable when v ≥ 0.7. Fig. 3b shows that PAA fluctuates with in- lationships were evaluated using ROC and AUC (Chen, 2015; Chen and
crease of v and tends to be stable when v ≥ 0.7. Fig. 3c shows that AUC Wu, 2016, 2017a, b, c).
increases with increase of σ and tends to be stable when σ ≥ 0.6. Fig. 3d By replacing the element concentration value with the anomaly
shows that PAA remains stable with increase of σ and increases rapidly degree, the AUCs for GMM and OCSVM were estimated through Eq.
when σ ≥ 0.6. Based on these results, v = 0.7 and σ = 0.6 were (11); and the SAUCs and ZAUCs were then computed using Eq. (12) based

14
Y. Chen, W. Wu Computers and Geosciences 125 (2019) 9–18

Fig. 4. Geochemical anomalies separated optimally from the modeling results of (a) GMM and (b) OCSVM.

on the AUCs. Table 6 lists the AUCs, SAUCs, ZAUCs and PRTs of GMM and because their AUCs (0.851 and 0.855) are approximately equal; and (c)
OCSVM. These statistics reveal: (a) the GMM and OCSVM modeling GMM is more efficient than OCSVM in data modeling because the PRT
results are significantly spatially associated with the known mineral of GMM is 18.67 s while that of OCSVM is 32.14 s.
deposit locations because the ZAUCs for the two methods (7.940 and A threshold can classify a data point into anomaly (predicted posi-
8.132) are much higher than the critical value of 1.96; (b) GMM and tive points) and background (predicted negative points) based on the
OCSVM perform similarly well in geochemical anomaly detection anomaly degree of the data point. The predicted positive and negative

15
Y. Chen, W. Wu Computers and Geosciences 125 (2019) 9–18

normalized value in the data matrix X is expressed as


x ij x minj
x ij = , (j = 1,2, …, m)
x maxj x minj (15)
where x ij and x ij are, respectively, the observed and normalized values
of element j on data point i; and x minj and x maxj are, respectively, the
minimum and maximum values of element j. The normalized data range
between 0 and 1.
In order to test whether the data re-scaling methods affect the
performance of GMM and OCSVM in geochemical anomaly detection,
the geochemical survey data were normalized using Eq. (15) and then
modelled using GMM and OCSVM.
The method used for modeling the standardized data in Section 4.1
was used to model the normalized geochemical data. The GMM mod-
eling results of the normalized data were compared with those of the
standardized data. Fig. 6a shows no significant difference in AUCs be-
tween the standardized and normalized data modeling results; and
Fig. 6b shows no significant difference in PAAs between the standar-
dized and normalized data modeling results. Therefore, the data re-
scaling methods have little influence on the performance of GMM in
geochemical anomaly detection.
The method used for modeling the standardized data in Section 4.2
was used to model the normalized geochemical data. The OCSVM
Fig. 5. ROC curves for GMM and OCSVM.
modeling results of the normalized data were compared with those of
the standardized data. Fig. 6c shows significant differences between
AUCs for the standardized and normalized data modeling results; and
points, as well as the true positive and true negative points defined in Fig. 6d shows significant differences between PAAs for the standardized
Section 3.2, were used to calculate the Benefit and Cost at the threshold. and normalized data modeling results. Therefore, the data re-scaling
Benefit is defined as the percentage of true positive points that are methods significantly affect the performance of OCSVM in geochemical
correctly classified as positive points; and Cost is defined as the per- anomaly detection.
centage of true negative points that are incorrectly classified as positive GMM only needs to define a positive-integer-valued parameter L
points (Chen and Wu, 2016). By changing threshold, the Cost and while OCSVM needs to define two positive-real-valued parameters σ
Benefit at each possible threshold are computed and the ROC curve is and v by trial and error. Therefore, GMM is easier to implement than
generated by plotting Benefit against Cost at all the thresholds. The OCSVM in geochemical anomaly detection. The disadvantage of GMM
closer the ROC curve is to the upper left corner of the ROC space, the is that repeated modeling results of the same geochemical data using
better the data modeling method performs in anomaly detection. Fig. 5 the same L value are often different, because the EM algorithm cannot
shows that the ROC curves of GMM and OCSVM intersect each other in guarantee finding the global maximum (Wu, 1983). How to solve this
the ROC space, so GMM and OCSVM have similar performance in problem needs further investigation.
geochemical anomaly detection. As long as the parameter L is large enough, the GMM model can
approximate almost any continuous population distribution. On the
5.2. Anomaly separation other hand, when the parameter L = 1, the GMM model degenerates
into a multivariate Gaussian function. Therefore, the GMM model can
The BT is selected from all possible thresholds for each data mod- be used as a universal anomaly detector to separate geochemical
eling result. Each possible threshold is used to separate geochemical anomalies from the sample data of any continuous population dis-
anomalies, and the Benefit and Cost at the threshold are then estimated tribution. In practice, we can use the GMM model with different L va-
and used to compute the Youden index. The Youden index is defined as lues (L = 1, 2, 3, …) to detect geochemical anomalies and select the
Benefit minus Cost (Chen and Wu, 2016). It can be viewed as an ex- model with the best performance. In this study, the GMM model with
pression of the spatial relationships between the separated geochemical the parameter L = 4 performs best in geochemical anomaly detection.
anomalies and the known mineral deposit locations. Of all the possible Thus, the study area has a complex geochemical background.
thresholds, the BT is the threshold with the largest Youden index (LYI). The geochemical anomalies delineated in Section 5.2 are mainly
In other words, compared with the geochemical anomalies separated by distributed in the southwest of the study area and some in the north of
any other thresholds, those separated by the BT have the highest spatial the study area. These geochemical anomalies spatially coincide with the
relationship with the known mineral deposit locations. Archean and Paleoproterozoic metamorphic rocks which provided the
The LYIs, BTs, PAAs, and Benefits of GMM and OCSVM are listed in fundamental substances for regional mineralization of the study area
Table 6. These statistics show that the geochemical anomalies separated (Wu et al., 1992; Zhang et al., 2011; Zhong et al., 2014). The dis-
by the GMM and OCSVM algorithms occupy, respectively, 14.46% and tribution direction of the separated geochemical anomalies is consistent
14.49% of the study area, and contain, respectively, 83% and 70% of with the regional structure of the study area. More than 80 per cent of
the known mineral deposits. Fig. 4 shows the geochemical anomalies known mineral deposits are located in the separated geochemical
separated optimally from the GMM and OCSVM modeling results. These anomalies. Therefore, the geochemical anomaly detection results are
anomalies are highly spatially associated with the known mineral de- consistent with the geological and metallogenic features of the study
posit locations. area.

5.3. Discussion 6. Conclusion

Besides standardization, normalization is another commonly used Gaussian mixture model was used to fit high dimensional geo-
data re-scaling method in geochemical anomaly detection. A chemical data, and multivariate geochemical anomalies were separated

16
Y. Chen, W. Wu Computers and Geosciences 125 (2019) 9–18

Fig. 6. Curves of AUC and PAA of the standardized and normalized data modeling results.

from the complex geochemical background successfully. This study References


shows that the performance of Gaussian mixture model is similar to that
of one-class support vector machine, but the data modeling efficiency of Bergmann, R., Ludbrook, J., Spooren, W.P.J.M., 2000. Different outcomes of the
Gaussian mixture model is higher than that of one-class support vector Wilcoxon-Mann-Whitney test from different statistics packages. Am. Statistician 54
(1), 72–77.
machine. Therefore, Gaussian mixture model is a potentially useful Bishop, M.C., 2006. Pattern Recognition and Machine Learning. Springer, pp. 738pp.
anomaly detection method with high performance and data modeling Chen, Y.L., 2015. Mineral potential mapping with a restricted Boltzmann machine. Ore
efficiency. The geochemical data modeling results of Gaussian mixture Geol. Rev. 71, 749–760.
Chen, Y.L., Wu, W., 2016. A prospecting cost-benefit strategy for mineral potential
model and one-class support vector machine are strongly consistent mapping based on ROC curve analysis. Ore Geol. Rev. 74, 26–38.
with the metallogenic characteristics of the study area. The separated Chen, Y.L., Wu, W., 2017a. Mapping mineral prospectivity using an extreme learning
geochemical anomalies have highly spatial relationships with the machine regression. Ore Geol. Rev. 80, 200–213.
Chen, Y.L., Wu, W., 2017b. Mapping mineral prospectivity by using one-class support
known mineral deposit locations in the study area.
vector machine to identify multivariate geological anomalies from digital geological
survey data. Aust. J. Earth Sci. 44 (5), 639–651.
Chen, Y.L., Wu, W., 2017c. Application of one-class support vector machine to quickly
identify multivariate anomalies from geochemical exploration data. Geochem.
Acknowledgements
Explor. Environ. Anal. 17, 231–238.
Dempster, A.P., Laird, N.M., Rubin, D.B., 1997. Maximum likelihood from incomplete
This work was supported by the National Natural Science data via the EM algorithm. J. Roy. Stat. Soc. B 39 (1), 1–38.
Foundation of China (Grant nos. 41472299 and 41672322). Drews Jr., P., Núňez, P., Rocha, P.R., Campos, M., Dias, J., 2013. Novelty detection and
segmentation based on Gaussian mixture models: a case study in 3D robotic laser
mapping. Robot. Autonom. Syst. 61, 1696–1709.
Huang, Z.K., Chau, K.W., 2008. A new image thresholding method based on Gaussian
Appendix A. Supplementary data mixture model. Appl. Math. Comput. 205, 899–907.
Khanmohammadi, S., Chou, Chun-An, 2016. A Gaussian mixture model based dis-
cretization algorithm for associative classification of medical data. Expert Syst. Appl.
Supplementary data to this article can be found online at https:// 58, 119–129.
doi.org/10.1016/j.cageo.2019.01.010. Kim, S.C., Kang, T.J., 2007. Texture classification and segmentation using wavelet packet

17
Y. Chen, W. Wu Computers and Geosciences 125 (2019) 9–18

frame and Gaussian mixture model. Pattern Recogn. 40, 1207–1221. ([in Chinese]).
Lindstrom, M.J., Bates, D.M., 1988. Newton-Raphson and EM algorithms for linear mixed- Wu, F., Lin, J., Wilde, S.A., Zhang, Q., Yang, J., 2005. Nature and significance of early
effects models for repeated-measures data. J. Am. Stat. Assoc. 83 (404), 1014–1022. Cretaceous giant igneous event in eastern China. Earth Planet. Sci. Lett. 233,
Li, B., Yang, Z., Wang, Y., 2010. Geological characteristics and genesis of Huanggoushan 103–119.
and Banmiaozi gold deposits in Laoling metallogenic belt of southern Jilin. Glob. Xie, X., Mu, X., Ren, T., 1997. Geochemical mapping in China. J. Geochem. Explor. 60,
Geol. 29 (3), 392–399 ([in Chinese]). 99–113.
Li, L.H., Hansman, J.R., Palacios, R., Welsch, R., 2016. Anomaly detection via a Gaussian Yang, Y.C., Feng, B.Z., Liu, P.E., 2001. Dahenglu type of cobalt deposit in Laoling area,
mixture model for flight operation and safety monitoring. Transport. Res. Part C 64, Jilin Province——A sedex depost with late reformation. Journal of Changchun
45–57. University of Science and Technology 31 (1), 40–45 ([in Chinese]).
Liu, W., Deng, J., Chu, X.L., Zhai, Y.S., Xu, G.Z., Li, X.J., 2000. Characteristics and geo- Yang, Y.C., Ye, S.Q., Feng, B.Z., 1999. The Huanggoushan typed hot-water deposition and
logical background of formation of large and giant ore deposits within the northern superimposed reformation gold deposit in Laoling mineralization belt of South Jilin
margin of the north China platform. Prog. Geophys. 15 (2), 67–78 ([in Chinese]). Province. Gold 6, 1–4 ([in Chinese]).
Liu, W., Man, Y., Wang, X., 2009. Geology and genesis of the Jinying gold deposit in Jilin Zhang, G.R., Jiang, S., Han, X.P., Huang, Z.F., Qu, H.X., Guo, W.J., Wang, F.J., 2006. The
Province. Geol. Resour. 18 (4), 279–283 ([in Chinese]). main characteristics of Yalujiang fault zone and its significance. Geol. Resour. 15 (1),
Qin, Y., Chen, D.D., Liang, Y.H., Zou, C.M., Zhang, Q.W., Bai, L.A., 2014. Geochronology 11–19 ([in Chinese]).
of Ji’an Goup in Tonghua area, southern Jilin Province. Earth Sci. J. China Univ. Zhang, L.M., Wang, D.S., Zhang, D.W., 2011. Geologic characteristics, ore-controlling
Geosci. 39 (11), 1587–1599 ([in Chinese]). factors and prospects of the Gaoligou gold deposit in Jilin Province. Geol. Resour. 20,
Simms, M.L., Blair, B., Ruz, J., Wurtz, R., Kaplan, D.A., Glenn, A., 2018. Pulse dis- 350–353 ([in Chinese]).
crimination with a Gaussian mixture model on an FPGA. Nucl. Instrum. Methods Zhao, G.M., Gao, C.B., Chou, J.B., Li, Z.Y., 1993. Base structure and the Yalu River fault
Phys. Res. A 900, 1–7. zone in Dandong district. Acta Seismol. Sin. (Chin. Ed.) 15 (3), 282–288 ([in
van Dyk, D.A., 2000. Fitting mixed-effects models using efficient EM-type algorithms. J. Chinese]).
Comput. Graph Stat. 9 (1), 78–98. Zheng, C.J., 1995. The geological features and origin of the Huanggoushan gold deposit,
Wu, C.F.J., 1983. On the convergence properties of the EM algorithm. Ann. Stat. 11 (1), Jilin Province. Jilin Geol. 14 (3), 1–16 ([in Chinese]).
95–103. Zhong, G.J., Run, T.Y., Cai, Y., 2014. Geological features and origin of Cuocaogou gold
Wu, D.Y., Yang, Y., Song, Q., 1992. Strata bound characteristics of gold, lead and zinc deposit. Western Prospecting Engineering 3, 117–124 ([in Chinese]).
deposits in the Ji’an Group, southern part of Jilin Province. Jilin Geol. 11 (4), 8–16

18

You might also like