0% found this document useful (0 votes)

35 views

Computers and Geosciences 125 (2019) 9-18

The document discusses using a Gaussian mixture model to separate geochemical anomalies from sample data of unknown distribution. It compares using a GMM to a one-class support vector machine on geochemical survey data from China. The GMM took less time to run and performed comparably to the SVM in detecting anomalies and known mineral deposits.

Uploaded by

Anwar Shah

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

35 views

Computers and Geosciences 125 (2019) 9-18

Uploaded by

Anwar Shah

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 10

Computers and Geosciences 125 (2019) 9–18

Contents lists available at ScienceDirect

Computers and Geosciences

journal homepage: www.elsevier.com/locate/cageo

Separation of geochemical anomalies from the sample data of unknown T

distribution population using Gaussian mixture model
Yongliang Chena,∗,1, Wei Wub,2
a
Institute of Mineral Resources Prognosis on Synthetic Information, Jilin University, Changchun, Jilin Province, 130026, PR China
b
Changchun Institute of Urban Planning and Design, Changchun, Jilin Province, 130033, PR China

ARTICLE INFO ABSTRACT

Keywords: The separation of geochemical anomalies from the sample data of unknown distribution population is a great
Geochemical anomaly challenge, as it is difficult to determine the correct model for the unknown population distribution. Gaussian
Gaussian mixture model mixture model is a linear combination of several Gaussians. By using enough number of Gaussians and by
One-class support vector machine adjusting parameters, the model can generate very complex probability density, which can approximate almost
Receiver operating characteristic
any continuous probability. Therefore, the Gaussian mixture model can fit the sample data of unknown dis-
Area under the curve
Youden index
tribution population, and those data points that do not conform to the model are considered as anomalies. The
method was used to separate multivariate anomalies from the geochemical survey data of 1:200,000 scale
collected from the Baishan district, Jilin Province, China, and compared with one-class support vector machine.
The programs running the two models took 18.67 and 32.14 s, respectively; the receiver operating characteristic
curves of the two models intersect each other in the ROC space; and area under the curves of the two models are
0.851 and 0.855 respectively. The “best” threshold determined by using the Youden index was used to separate
geochemical anomalies. The anomalies separated from the modeling results of the two models occupy respec-
tively 14.46% and 14.49% of the study area and contain respectively 83% and 70% of the known mineral
deposits. Therefore, Gaussian mixture model is comparable to one-class support vector machine in geochemical
anomaly detection. It can be used as a geochemical anomaly detector with high performance and data modeling
efficiency.

1. Introduction population distribution. A GMM may then be used to model the geo-
chemical data for expressing the background, and those data points that
Gaussian mixture model (GMM) is a linear combination of several do not conform to the model are separated as anomalies. As a demon-
Gaussians. By using a sufficient number of Gaussians and by adjusting stration, GMM was used to separate geochemical anomalies from the
their means and covariances as well as the coefficients in the linear stream sediment survey data of 1:200,000 scale collected from the
combination, GMM can fit almost any continuous probability (Bishop, Baishan district, Jilin Province, China and compared with one-class
2006). Given the sample data drawn from an unknown complex dis- support vector machine (OCSVM) (Chen and Wu, 2017b, c). The pro-
tribution, the maximum likelihood parameters of the GMM can be de- gram run time (PRT), receiver operating characteristic (ROC) and area
termined by the expectation maximization (EM) algorithm (Dempster under the curve (AUC) (Chen, 2015; Chen and Wu, 2016, 2017a, b, c)
et al., 1997; Lindstrom and Bates, 1988; van Dyk, 2000). GMMs have were used to evaluate the data-modeling efficiencies and performances
been applied in data classification, image segmentation, target dis- of GMM and OCSVM in geochemical anomaly detection. The main
crimination, and novelty detection (Drews-Jr., 2013; Huang and Chau, contribution of this paper is that a GMM-based anomaly detector with
2008; Khanmohammadi and Chou, 2016; Kim and Kang, 2007; Li et al., high performance and data-modeling efficiency is developed for se-
2016; Simms et al., 2018). parating geochemical anomalies from the sample data of unknown
In geochemical exploration, one may assume that geochemical data distribution population.
points are drawn independently and randomly from an unknown

∗
Corresponding author.
E-mail address: [email protected] (Y. Chen).
1
Geochemical anomaly detection using GMM and OCSVM and manuscript writing.
2
Geochemical data preprocessing and thematic map generating using Surfer and Grapher.

https://doi.org/10.1016/j.cageo.2019.01.010
Received 2 August 2018; Received in revised form 4 November 2018; Accepted 14 January 2019
Available online 18 January 2019
0098-3004/ © 2019 Elsevier Ltd. All rights reserved.
Y. Chen, W. Wu Computers and Geosciences 125 (2019) 9–18

2. GMM-based geochemical anomaly detector from component l and the lth mixing coefficient can be, respectively,
calculated by
A n × m matrix X is used to express the geochemical data set, n
{x1, x2, , xn} , with the ith data point xi = (x i1, xi2, , x im)T . The entry nl = p (z l = 1|x i ),
x ij (i = 1,2, , n ; j = 1,2, , m) represents the observed value of element i=1 (6)
j on data point i. The geochemical data need to be standardized so that and
each element has zero mean and unit standard deviation. The stan-
n
dardized value of element j is expressed as =
1
p (z l = 1|xi ), (l = 1,2, , L).
l
n (7)
x ij x¯j i=1
x ij = , (j = 1,2, …, m )
j (1) where nl is the effective number of data points drawn from component l;
l is the lth mixing coefficient; and p (z l = 1|x i ) is the responsibility that
where x ij and x ij are, respectively, the observed and standardized va- can be computed by Eq. (5).
lues of element j; and x̄ j and j are, respectively, the mean and standard
Bishop (2006) showed that by setting the derivatives of the loga-
deviation of element j.
rithmic likelihood in Eq. (3) with respect to the means µl to zero, the
Assuming that the n data points are drawn independently and ran-
following maximum likelihood solution for µl is obtained:
domly from an unknown distribution, the data distribution can be re-
n
presented by the following GMM: 1
µl = p (z l = 1|xi ) xi , (l = 1,2, , L).
L nl i=1 (8)
p (x ) = l g ({x|µl , l}).
l (2) where p (z l = 1|xi ) and nl can be, respectively, calculated by Eqs. (5)
and (6).
where p (x ) is the marginal probability; L is the number of Gaussians; l Bishop (2006) proved that by setting the derivative of the loga-
is the lth mixing coefficient that satisfies 0 1 and l l = 1; and
L
l rithmic likelihood in Eq. (3) with respect to l to zero, and by making
g ({x|µl , l}) is the Gaussian density called the lth component and has its use of the result for the maximum likelihood solution for the covariance
own mean µl and covariance l . The mixing coefficient l can be viewed matrix of a single Gaussian, the following maximum likelihood solution
as the prior probability of picking the lth component, and the density for l is obtained:
g ({x|µl , l}) can be viewed as the probability of data point x condi- n
tioned on the lth component (Bishop, 2006). 1
l = p (z l = 1|xi ) (x i µ l ) (x i µl ) T , (l = 1,2, , L).
The notations π = {π1, …, πL}, μ = {μ1, …, μL} and Σ = {Σ1, …, ΣL} nl i=1 (9)
are used to represent all the parameters of the GMM of Eq. (2). These
where p (z l = 1|xi ) and nl can be, respectively, calculated by Eqs. (5)
parameters can be determined by maximizing the following logarithmic
and (6); and µl can be calculated by Eq. (8).
likelihood:
The EM algorithm for determining the GMM parameters starts with
n L randomly initialized parameter values of π, μ, Σ. First, estimate re-
lnp (X | , µ , )= ln l g (x i |µl , l) , sponsibilities of the L latent variables using the current parameter va-
i=1 l=1 (3) lues of π, μ, Σ, and then seek the maximum likelihood solution for
where X is the known geochemical data set. parameters corresponding to the latent variables. Keep alternating until
Eq. (3) involves the unknown parameters π, μ, Σ as well as the the resulting values converge to fixed points. The pseudo-code for this
unobserved latent variables that determine the component from which iteration is outlined in Table 1.
the data points originate. Finding a maximum likelihood solution of Eq. Based on the output values of π, μ, Σ, the probability p (xi ) of each
(3) leads to a set of unsolvable equations in which the solution to the data point xi, (i = 1, 2, …, n), can be calculated by Eq. (2). This
latent variables needs the known parameter values and vice versa. probability can be viewed as the degree to which the data point xi
Fortunately, this problem can be solved by an EM algorithm (Lindstrom conforms to the GMM. Based on p (xi ) , the anomaly degree of data point
and Bates, 1988; van Dyk, 2000) but the algorithm cannot guarantee xi is defined as
finding the global maximum (Wu, 1983). s (xi ) = max {lnp (xk )} lnp (xi ), (i = 1,2, , n)
Let's use a n × L matrix Z = (z il )n × L to express unobserved data of 1 k n (10)
the L latent variables. The entry z il satisfies z il {0, 1} and l = 1 z il = 1.
L
where s (xi ) is the anomaly degree of data point xi, (i = 1, 2, …, n). It
When the data point xi originates from component l, z il = 1, otherwise
has a non-negative value and is negatively correlated with the loga-
z il = 0 . Using the known parameters π, μ and Σ, Bishop (2006) proved
rithmic probability lnp (xi ) . It can be regarded as the degree that the
the following relationship between z l = 1 and l :
data point xi does not conform to the model. The larger value of s (xi )
p (z l = 1) = l, (l = 1,2, , L). (4) the more likely that xi is an anomaly data point.
For geochemical anomaly detection, GMM is first used to model the
where p (z l = 1) is the prior probability of z l = 1, and l is the lth mixing
standardized geochemical data to represent the background, and then
coefficient.
Eq. (10) is used to compute s (xi ) for each data point xi. A threshold is
According to the Bayesian theorem, we can use the known para-
eventually used to identify anomaly data points that clearly do not
meters π, μ and Σ to express the conditional probability of z l = 1 given
conform to the GMM model. If there are some known mineral deposits
xi as follows:
in the study area, the Youden index can be used to determine the “best”
l g (x i | µl , l ) threshold (BT) (Chen, 2015; Chen and Wu, 2016, 2017a, b, c). How to
p (z l = 1|xi ) = L
, (l = 1,2, , L; i = 1,2, …, n). determine BT is discussed in Section 5.2.
k=1 k
g (xi |µk , k ) (5)

where p (z l = 1|xi ) is the posterior probability of z l = 1 given xi . It can 3. Study area and geochemical data
be viewed as the responsibility that the lth component takes for ‘ex-
plaining’ xi (Bishop, 2006). The study area is located in the Baishan district, Jilin Province,
The n × L responsibilities can be computed by Eq. (5) and used to China. It covers four geological maps of 1:200,000 scale. The stream
solve the maximum likelihood solution for the parameters π, μ and Σ. sediment survey data were collected from about 26,500 km2 in the
According to Bishop (2006), the effective number of data points drawn study area.

10
Y. Chen, W. Wu Computers and Geosciences 125 (2019) 9–18

Table 1
Pseudo-code for the parameter estimation of GMM using the EM algorithm.

3.1. Geology and polymetallic mineralization any given grid point. The stream sediment survey data of the 35 ele-
ments were transformed into the 35 grid maps.
The study area is in the north margin of the north China platform and The AUCs and ZAUCs were computed based on each of the 35 grid
had experienced a complex geological evolution (Liu et al., 2000; Wu et al., maps and the known mineral deposit locations in the study area. The 30
2005; Qin et al., 2014). Widely exposed geological formations include the known mineral deposit locations in the study area were used as “the
Archean plutonic rocks, the Paleoproterozoic metamorphic rocks, the ground truth data” for defining the true positive and true negative
Neoproterozoic-Paleozoic sedimentary rocks, and the Mesozoic volcanic points. If a unit cell represented by a grid point contains a mineral
and volcanic-sedimentary rocks as well as the Paleoproterozoic gneiss deposit, the grid point is called a true positive point; otherwise it's
granites and the Mesozoic granites (Fig. 1). The Ji'an-Songjiang tectonic belt called a true negative point. Suppose that there are p true positive
traverses the whole study area and controls the spatial distribution of points and q true negative points in the study area. According to the
geological formations (Zhao et al., 1993; Zhang et al., 2006). Wilcoxon test of ranks (Bergmann et al., 2000; Chen, 2015; Chen and
The Paleoproterozoic metamorphic formations provided funda- Wu, 2016), the AUC of an element can be calculated by
mental substances for polymetallic mineralization (Wu et al., 1992; p q
1
Zhang et al., 2011; Zhong et al., 2014). The Mesozoic granites and AUC = (xi , yj )
granite porphyries provided heat sources for the polymetallic miner- pq i=1 j=1 (11)
alization and some mineralization substances (Zheng, 1995; Liu et al.,
with
2000; Wu et al., 2005). Along the Ji'an-Songjiang tectonic belt and
around the Mesozoic granites and granite porphyries, about 30 hy- 1, xi > yj
drothermal deposits have been discovered (Yang et al., 1999; Liu et al., (xi , yj ) = 0.5, xi = yj
2009; Li et al., 2010). 0, xi < yj

where xi (i = 1, 2, …, p) represents the concentration value of the

3.2. Geochemical survey data element at the ith true positive point and yj (j = 1, 2, …, q) represents
the concentration value of the element at the jth true negative point.
The stream sediment survey data come from the China's National According to Chen (2015), ZAUC can be written as
Geochemical Mapping Project (Xie et al., 1997). The sampling density
was 1 sample per 4 km2and six major and 29 minor and trace elements AUC 0.5
ZAUC =
were, respectively, analyzed by AAS and ICP-AES. There were 6608 SEAUC (12)
stream sediment samples used for this study. The stream sediment with
survey data of each element were transformed into a 100 × 150 grid
map using the interpolating method of Inverse Distance to a Power in AUC (1 AUC ) + (p 1) ( AUC
)
AUC2 + (q 1)
2AUC2
AUC2
2 AUC 1 + AUC
the Golden Software Surfer of version 12. Each grid point in the map SEAUC =
pq
represents an unit cell size of 1.2167 × 1.6365 km2. This cell size sa-
tisfies that only one mineral deposit presents in the cell represented by where SEAUC is the standard deviation of AUC.

11
Y. Chen, W. Wu Computers and Geosciences 125 (2019) 9–18

Fig. 1. Simplified geologic map and discovered mineral deposits (Chen and Wu, 2016).

The AUC reflects spatial relationship between element concentra- et al., 2009; Li et al., 2010). The high-metamorphic rocks are rich in
tions and mineral deposit locations. A value of AUC is in range of 0.5–1, boron and the low-metamorphic rocks are rich in CaO and MgO. Ac-
which corresponds respectively to random and deterministic relation- cordingly, boron, CaO, and MgO were chosen as metallogenic in-
ships between element concentrations and mineral deposit locations. dicators. According to the characteristics of the known hydrothermal
The ZAUC is a normal distributed statistic for testing whether an AUC is mineral deposits in the study area, Au, Ag, Cu, Pb, Zn, and Co are the
significantly different from 0.5. If the estimated value of ZAUC is greater primary metallogenic elements; and As, Sb, Bi, and Hg are the asso-
than the critical value of 1.96 at the significance level of 0.05, the ciated metallogenic elements. Hence, gold, Ag, Cu, Pb, Zn, Co, As, Sb,
spatial relationship between element concentrations and mineral de- Bi, and Hg were chosen as metallogenic indicators. Finally, gold, Ag, As,
posit locations is considered to be significant. Table 2 lists the estimated B, Bi, CaO, Co, Cu, Hg, MgO, Pb, Sb, and Zn were chosen as the me-
AUCs and ZAUCs of 35 elements. It shows that Au, Ag, As, B, Bi, CaO, Cd, tallogenic indicators.
Cu, Hg, MgO, P, Pb, Sb, Sn, V, W, and Zn are effective candidate me-
tallogenic indicators of hydrothermal deposits in the study area because
their ZAUCs are greater than the critical value of 1.96. 4. Geochemical data modeling
Through the analysis of the above statistical results and the analysis
of the geological and mineralization characteristics of the study area, GMM and OCSVM were used to model the standardized stream se-
the metallogenic indicators were selected from the 35 elements. The diment survey data. The Python codes from the scikit learn were used
Paleoproterozoic formations have genetic relationships with the poly- for GMM and OCSVM modeling; and the Python codes developed by
metallic mineralization (Wu et al., 1992; Yang et al., 1999, 2001; Liu Yongliang Chen was used for data input and output as well as perfor-
mance evaluation.

Table 2
AUCs and ZAUCs for 35 elements (Chen and Wu, 2016).
Element AUC ZAUC Element AUC ZAUC Element AUC ZAUC

Ag 0.741 4.616 Cu 0.801 6.190 Pb 0.718 4.095

Al2O3 0.567 1.231 Fe2O3 0.558 1.073 Sb 0.720 4.141
As 0.717 4.074 Hg 0.682 3.365 SiO2 0.385 −2.443
Au 0.773 5.412 La 0.484 −0.306 Sn 0.667 3.071
B 0.626 2.301 Li 0.590 1.651 Sr 0.555 1.019
Ba 0.491 −0.179 MgO 0.817 6.692 Ti 0.427 −1.477
Be 0.312 −4.560 Mn 0.501 0.011 V 0.683 3.373
Bi 0.742 4.629 Mo 0.498 −0.037 W 0.675 3.222
CaO 0.785 5.735 Na2O 0.407 −1.935 Y 0.386 −2.426
Cd 0.798 6.109 Nb 0.298 −5.035 Zn 0.670 3.120
Co 0.527 0.497 Ni 0.605 1.911 Zr 0.239 −7.630
Cr 0.556 1.029 P 0.697 3.663

12
Y. Chen, W. Wu Computers and Geosciences 125 (2019) 9–18

Table 3 Table 4
AUCs and PAAs for all L values in the GMM modeling. AUCs for all pairs of σs and vs in the OCSVM modeling.
Modeling times First Second Third σ AUC v 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

L AUC PAA (%) AUC PAA (%) AUC PAA (%) 0.1 0.745 0.779 0.799 0.808 0.812 0.815 0.819 0.821 0.822
1 0.818 8.36 0.818 8.36 0.818 8.36 0.2 0.763 0.779 0.801 0.812 0.821 0.826 0.829 0.830 0.830
2 0.845 16.19 0.846 14.93 0.846 14.92 0.3 0.767 0.786 0.807 0.821 0.830 0.835 0.836 0.836 0.835
3 0.850 14.49 0.850 14.49 0.850 14.49 0.4 0.777 0.791 0.813 0.828 0.837 0.843 0.843 0.842 0.840
4 0.851 14.46 0.851 14.46 0.851 14.20 0.5 0.782 0.799 0.819 0.834 0.843 0.848 0.850 0.846 0.844
5 0.843 15.05 0.843 15.05 0.843 15.05 0.6 0.789 0.808 0.824 0.838 0.847 0.852 0.855 0.851 0.848
6 0.856 42.10 0.834 39.55 0.841 36.75 0.7 0.780 0.817 0.827 0.842 0.851 0.855 0.858 0.856 0.851
7 0.857 36.67 0.858 33.43 0.861 47.15 0.8 0.809 0.826 0.832 0.847 0.854 0.858 0.860 0.860 0.855
8 0.851 23.26 0.841 36.41 0.851 35.39 0.9 0.817 0.831 0.838 0.850 0.856 0.860 0.862 0.862 0.859
9 0.852 37.84 0.851 45.22 0.847 40.05 1.0 0.817 0.832 0.846 0.852 0.857 0.861 0.863 0.864 0.862
10 0.855 37.63 0.855 39.81 0.846 45.39

Table 5
4.1. GMM modeling PAAs for all pairs of σs and vs in the OCSVM modeling.
v PAA 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
The number of mixing components (L) needs to be defined for GMM. (%)σ
To test how the performance of GMM varies with the value of L, let the
value of L start at 1 and increase by 1 at a time, all the way to 10. For 0.1 18.25 13.73 12.37 16.27 16.25 18.14 14.82 14.36 14.13
each value of L, the geochemical data were modelled three times using 0.2 8.96 14.89 13.73 11.81 12.57 14.02 11.37 11.39 11.45
0.3 13.90 10.23 17.67 13.55 12.04 12.61 13.73 10.86 10.77
the GMM, and the anomaly degree of each data point was calculated
0.4 11.6 10.46 22.59 16.17 13.68 12.51 12.65 13.68 14.59
using Eq. (10). For each modeling result, the AUC value was calculated 0.5 16.66 10.61 10.94 19.21 15.41 13.93 13.21 12.76 13.61
using Eq. (11), and the percentage of anomaly areas (PAA) was esti- 0.6 22.14 10.91 7.89 27.94 18.46 18.69 14.49 13.92 13.75
mated based on the geochemical anomalies separated by the method 0.7 20.74 37.49 7.89 26.90 19.45 18.18 18.83 15.44 14.92
0.8 41.10 37.94 37.23 37.52 39.12 18.54 18.32 25.35 25.44
described in Section 5.2. Table 3 lists the AUCs and PAAs estimated
0.9 45.37 42.72 43.98 37.27 24.61 41.43 25.09 24.79 24.77
using the modeling results; and Fig. 2 shows the curves of AUC and PAA 1.0 42.00 42.23 40.59 35.25 36.87 39.58 20.61 20.63 24.04
varying with L.
Fig. 2a shows that AUC increases as L increases and fluctuates when
L ≥ 5. Fig. 2b shows that PAA is less than 0.162 when L < 5 while it distribution but located outside of the support subset. By referring to
jumps to more than 0.23 when L ≥ 5. According to this result, L = 4 is Chen and Wu (2017c), anomaly degree of data point x can be written as
considered to be optimal for the GMM modeling. This result is also n
consistent with the geological characteristics because only five litho- f (x ) = i [K (x i , x j ) K (xi , x )], j [1,2, …, n].
logic formations are widely-exposed in the study area. Therefore, the i=1 (13)
GMM modeling result at L = 4 was used to separate geochemical
anomalies. where f (x ) is anomaly degree of data point x; i, (i = 1,2, …, n), is the
Lagrange parameter; K ( , ) is Gaussian kernel; and n is the number of
data points.
4.2. OCSVM modeling The parameters σ and v need to be determined for OCSVM by trial
and error. In this study, each pair of σ and v were selected respectively
An OCSVM can model geochemical data without any assumptions from the two sequences of {0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0}
on the data distribution. The model seeks a subset in the input space to and {0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9}, and used to initialize
support the high-dimensional distribution of input data; and anomaly OCSVM. The initialized OCSVM was trained on all the data points, and
data points are those which are drawn from the high-dimensional then the anomaly degree of each data point was calculated by Eq. (13).

Fig. 2. Curves of AUC and PAA varying the number of mixing components.

13
Y. Chen, W. Wu Computers and Geosciences 125 (2019) 9–18

Fig. 3. Curves of AUC and PAA varying with σ and v.

Table 6 considered to be optimal for the OCSVM modeling. Therefore, the

Statistics for the GMM and OCSVM modeling results. OCSVM modeling result at v = 0.7 and σ = 0.6 was used to separate
Statistics GMM (L = 4) OCSVM (v = 0.7; σ = 0.6) geochemical anomalies.

AUC 0.851 0.855

5. Results and discussion
SAUC 0.0441 0.0436
ZAUC 7.953 8.132
LYI 0.557 0.556 The performances of GMM and OCSVM in geochemical anomaly
BT 30.611 264.932 detection were evaluated using ROC and AUC, and the data modeling
PAA (%) 14.46 14.49 efficiencies of GMM and OCSVM were evaluated using PRT. The BT for
Benefit (%) 83 70
separating geochemical anomalies was determined by using the Youden
PRT (s) 18.67 32.14
index (Chen, 2015; Chen and Wu, 2016, 2017a, b, c).

The AUC and PAA for each modeling result were estimated using the 5.1. Performance evaluation
methods describe in Section 4.1. Tables 4 and 5 list the AUCs and PAAs,
respectively, for all pairs of σs and vs in the OCSVM modeling; and If the GMM and OCSVM algorithms perform well in geochemical
Fig. 3 shows the curves of AUC and PAA varying with σ and v. anomaly detection, their modeling results should be highly spatially
Fig. 3a shows that AUC increases rapidly with increase of v and tend associated with the known mineral deposit locations. These spatial re-
to be stable when v ≥ 0.7. Fig. 3b shows that PAA fluctuates with in- lationships were evaluated using ROC and AUC (Chen, 2015; Chen and
crease of v and tends to be stable when v ≥ 0.7. Fig. 3c shows that AUC Wu, 2016, 2017a, b, c).
increases with increase of σ and tends to be stable when σ ≥ 0.6. Fig. 3d By replacing the element concentration value with the anomaly
shows that PAA remains stable with increase of σ and increases rapidly degree, the AUCs for GMM and OCSVM were estimated through Eq.
when σ ≥ 0.6. Based on these results, v = 0.7 and σ = 0.6 were (11); and the SAUCs and ZAUCs were then computed using Eq. (12) based

14
Y. Chen, W. Wu Computers and Geosciences 125 (2019) 9–18

Fig. 4. Geochemical anomalies separated optimally from the modeling results of (a) GMM and (b) OCSVM.

on the AUCs. Table 6 lists the AUCs, SAUCs, ZAUCs and PRTs of GMM and because their AUCs (0.851 and 0.855) are approximately equal; and (c)
OCSVM. These statistics reveal: (a) the GMM and OCSVM modeling GMM is more efficient than OCSVM in data modeling because the PRT
results are significantly spatially associated with the known mineral of GMM is 18.67 s while that of OCSVM is 32.14 s.
deposit locations because the ZAUCs for the two methods (7.940 and A threshold can classify a data point into anomaly (predicted posi-
8.132) are much higher than the critical value of 1.96; (b) GMM and tive points) and background (predicted negative points) based on the
OCSVM perform similarly well in geochemical anomaly detection anomaly degree of the data point. The predicted positive and negative

15
Y. Chen, W. Wu Computers and Geosciences 125 (2019) 9–18

normalized value in the data matrix X is expressed as

x ij x minj
x ij = , (j = 1,2, …, m)
x maxj x minj (15)
where x ij and x ij are, respectively, the observed and normalized values
of element j on data point i; and x minj and x maxj are, respectively, the
minimum and maximum values of element j. The normalized data range
between 0 and 1.
In order to test whether the data re-scaling methods affect the
performance of GMM and OCSVM in geochemical anomaly detection,
the geochemical survey data were normalized using Eq. (15) and then
modelled using GMM and OCSVM.
The method used for modeling the standardized data in Section 4.1
was used to model the normalized geochemical data. The GMM mod-
eling results of the normalized data were compared with those of the
standardized data. Fig. 6a shows no significant difference in AUCs be-
tween the standardized and normalized data modeling results; and
Fig. 6b shows no significant difference in PAAs between the standar-
dized and normalized data modeling results. Therefore, the data re-
scaling methods have little influence on the performance of GMM in
geochemical anomaly detection.
The method used for modeling the standardized data in Section 4.2
was used to model the normalized geochemical data. The OCSVM
Fig. 5. ROC curves for GMM and OCSVM.
modeling results of the normalized data were compared with those of
the standardized data. Fig. 6c shows significant differences between
AUCs for the standardized and normalized data modeling results; and
points, as well as the true positive and true negative points defined in Fig. 6d shows significant differences between PAAs for the standardized
Section 3.2, were used to calculate the Benefit and Cost at the threshold. and normalized data modeling results. Therefore, the data re-scaling
Benefit is defined as the percentage of true positive points that are methods significantly affect the performance of OCSVM in geochemical
correctly classified as positive points; and Cost is defined as the per- anomaly detection.
centage of true negative points that are incorrectly classified as positive GMM only needs to define a positive-integer-valued parameter L
points (Chen and Wu, 2016). By changing threshold, the Cost and while OCSVM needs to define two positive-real-valued parameters σ
Benefit at each possible threshold are computed and the ROC curve is and v by trial and error. Therefore, GMM is easier to implement than
generated by plotting Benefit against Cost at all the thresholds. The OCSVM in geochemical anomaly detection. The disadvantage of GMM
closer the ROC curve is to the upper left corner of the ROC space, the is that repeated modeling results of the same geochemical data using
better the data modeling method performs in anomaly detection. Fig. 5 the same L value are often different, because the EM algorithm cannot
shows that the ROC curves of GMM and OCSVM intersect each other in guarantee finding the global maximum (Wu, 1983). How to solve this
the ROC space, so GMM and OCSVM have similar performance in problem needs further investigation.
geochemical anomaly detection. As long as the parameter L is large enough, the GMM model can
approximate almost any continuous population distribution. On the
5.2. Anomaly separation other hand, when the parameter L = 1, the GMM model degenerates
into a multivariate Gaussian function. Therefore, the GMM model can
The BT is selected from all possible thresholds for each data mod- be used as a universal anomaly detector to separate geochemical
eling result. Each possible threshold is used to separate geochemical anomalies from the sample data of any continuous population dis-
anomalies, and the Benefit and Cost at the threshold are then estimated tribution. In practice, we can use the GMM model with different L va-
and used to compute the Youden index. The Youden index is defined as lues (L = 1, 2, 3, …) to detect geochemical anomalies and select the
Benefit minus Cost (Chen and Wu, 2016). It can be viewed as an ex- model with the best performance. In this study, the GMM model with
pression of the spatial relationships between the separated geochemical the parameter L = 4 performs best in geochemical anomaly detection.
anomalies and the known mineral deposit locations. Of all the possible Thus, the study area has a complex geochemical background.
thresholds, the BT is the threshold with the largest Youden index (LYI). The geochemical anomalies delineated in Section 5.2 are mainly
In other words, compared with the geochemical anomalies separated by distributed in the southwest of the study area and some in the north of
any other thresholds, those separated by the BT have the highest spatial the study area. These geochemical anomalies spatially coincide with the
relationship with the known mineral deposit locations. Archean and Paleoproterozoic metamorphic rocks which provided the
The LYIs, BTs, PAAs, and Benefits of GMM and OCSVM are listed in fundamental substances for regional mineralization of the study area
Table 6. These statistics show that the geochemical anomalies separated (Wu et al., 1992; Zhang et al., 2011; Zhong et al., 2014). The dis-
by the GMM and OCSVM algorithms occupy, respectively, 14.46% and tribution direction of the separated geochemical anomalies is consistent
14.49% of the study area, and contain, respectively, 83% and 70% of with the regional structure of the study area. More than 80 per cent of
the known mineral deposits. Fig. 4 shows the geochemical anomalies known mineral deposits are located in the separated geochemical
separated optimally from the GMM and OCSVM modeling results. These anomalies. Therefore, the geochemical anomaly detection results are
anomalies are highly spatially associated with the known mineral de- consistent with the geological and metallogenic features of the study
posit locations. area.

5.3. Discussion 6. Conclusion

Besides standardization, normalization is another commonly used Gaussian mixture model was used to fit high dimensional geo-
data re-scaling method in geochemical anomaly detection. A chemical data, and multivariate geochemical anomalies were separated

16
Y. Chen, W. Wu Computers and Geosciences 125 (2019) 9–18

Fig. 6. Curves of AUC and PAA of the standardized and normalized data modeling results.

from the complex geochemical background successfully. This study References

shows that the performance of Gaussian mixture model is similar to that
of one-class support vector machine, but the data modeling efficiency of Bergmann, R., Ludbrook, J., Spooren, W.P.J.M., 2000. Different outcomes of the
Gaussian mixture model is higher than that of one-class support vector Wilcoxon-Mann-Whitney test from different statistics packages. Am. Statistician 54
(1), 72–77.
machine. Therefore, Gaussian mixture model is a potentially useful Bishop, M.C., 2006. Pattern Recognition and Machine Learning. Springer, pp. 738pp.
anomaly detection method with high performance and data modeling Chen, Y.L., 2015. Mineral potential mapping with a restricted Boltzmann machine. Ore
efficiency. The geochemical data modeling results of Gaussian mixture Geol. Rev. 71, 749–760.
Chen, Y.L., Wu, W., 2016. A prospecting cost-benefit strategy for mineral potential
model and one-class support vector machine are strongly consistent mapping based on ROC curve analysis. Ore Geol. Rev. 74, 26–38.
with the metallogenic characteristics of the study area. The separated Chen, Y.L., Wu, W., 2017a. Mapping mineral prospectivity using an extreme learning
geochemical anomalies have highly spatial relationships with the machine regression. Ore Geol. Rev. 80, 200–213.
Chen, Y.L., Wu, W., 2017b. Mapping mineral prospectivity by using one-class support
known mineral deposit locations in the study area.
vector machine to identify multivariate geological anomalies from digital geological
survey data. Aust. J. Earth Sci. 44 (5), 639–651.
Chen, Y.L., Wu, W., 2017c. Application of one-class support vector machine to quickly
identify multivariate anomalies from geochemical exploration data. Geochem.
Acknowledgements
Explor. Environ. Anal. 17, 231–238.
Dempster, A.P., Laird, N.M., Rubin, D.B., 1997. Maximum likelihood from incomplete
This work was supported by the National Natural Science data via the EM algorithm. J. Roy. Stat. Soc. B 39 (1), 1–38.
Foundation of China (Grant nos. 41472299 and 41672322). Drews Jr., P., Núňez, P., Rocha, P.R., Campos, M., Dias, J., 2013. Novelty detection and
segmentation based on Gaussian mixture models: a case study in 3D robotic laser
mapping. Robot. Autonom. Syst. 61, 1696–1709.
Huang, Z.K., Chau, K.W., 2008. A new image thresholding method based on Gaussian
Appendix A. Supplementary data mixture model. Appl. Math. Comput. 205, 899–907.
Khanmohammadi, S., Chou, Chun-An, 2016. A Gaussian mixture model based dis-
cretization algorithm for associative classification of medical data. Expert Syst. Appl.
Supplementary data to this article can be found online at https:// 58, 119–129.
doi.org/10.1016/j.cageo.2019.01.010. Kim, S.C., Kang, T.J., 2007. Texture classification and segmentation using wavelet packet

17
Y. Chen, W. Wu Computers and Geosciences 125 (2019) 9–18

frame and Gaussian mixture model. Pattern Recogn. 40, 1207–1221. ([in Chinese]).
Lindstrom, M.J., Bates, D.M., 1988. Newton-Raphson and EM algorithms for linear mixed- Wu, F., Lin, J., Wilde, S.A., Zhang, Q., Yang, J., 2005. Nature and significance of early
effects models for repeated-measures data. J. Am. Stat. Assoc. 83 (404), 1014–1022. Cretaceous giant igneous event in eastern China. Earth Planet. Sci. Lett. 233,
Li, B., Yang, Z., Wang, Y., 2010. Geological characteristics and genesis of Huanggoushan 103–119.
and Banmiaozi gold deposits in Laoling metallogenic belt of southern Jilin. Glob. Xie, X., Mu, X., Ren, T., 1997. Geochemical mapping in China. J. Geochem. Explor. 60,
Geol. 29 (3), 392–399 ([in Chinese]). 99–113.
Li, L.H., Hansman, J.R., Palacios, R., Welsch, R., 2016. Anomaly detection via a Gaussian Yang, Y.C., Feng, B.Z., Liu, P.E., 2001. Dahenglu type of cobalt deposit in Laoling area,
mixture model for flight operation and safety monitoring. Transport. Res. Part C 64, Jilin Province——A sedex depost with late reformation. Journal of Changchun
45–57. University of Science and Technology 31 (1), 40–45 ([in Chinese]).
Liu, W., Deng, J., Chu, X.L., Zhai, Y.S., Xu, G.Z., Li, X.J., 2000. Characteristics and geo- Yang, Y.C., Ye, S.Q., Feng, B.Z., 1999. The Huanggoushan typed hot-water deposition and
logical background of formation of large and giant ore deposits within the northern superimposed reformation gold deposit in Laoling mineralization belt of South Jilin
margin of the north China platform. Prog. Geophys. 15 (2), 67–78 ([in Chinese]). Province. Gold 6, 1–4 ([in Chinese]).
Liu, W., Man, Y., Wang, X., 2009. Geology and genesis of the Jinying gold deposit in Jilin Zhang, G.R., Jiang, S., Han, X.P., Huang, Z.F., Qu, H.X., Guo, W.J., Wang, F.J., 2006. The
Province. Geol. Resour. 18 (4), 279–283 ([in Chinese]). main characteristics of Yalujiang fault zone and its significance. Geol. Resour. 15 (1),
Qin, Y., Chen, D.D., Liang, Y.H., Zou, C.M., Zhang, Q.W., Bai, L.A., 2014. Geochronology 11–19 ([in Chinese]).
of Ji’an Goup in Tonghua area, southern Jilin Province. Earth Sci. J. China Univ. Zhang, L.M., Wang, D.S., Zhang, D.W., 2011. Geologic characteristics, ore-controlling
Geosci. 39 (11), 1587–1599 ([in Chinese]). factors and prospects of the Gaoligou gold deposit in Jilin Province. Geol. Resour. 20,
Simms, M.L., Blair, B., Ruz, J., Wurtz, R., Kaplan, D.A., Glenn, A., 2018. Pulse dis- 350–353 ([in Chinese]).
crimination with a Gaussian mixture model on an FPGA. Nucl. Instrum. Methods Zhao, G.M., Gao, C.B., Chou, J.B., Li, Z.Y., 1993. Base structure and the Yalu River fault
Phys. Res. A 900, 1–7. zone in Dandong district. Acta Seismol. Sin. (Chin. Ed.) 15 (3), 282–288 ([in
van Dyk, D.A., 2000. Fitting mixed-effects models using efficient EM-type algorithms. J. Chinese]).
Comput. Graph Stat. 9 (1), 78–98. Zheng, C.J., 1995. The geological features and origin of the Huanggoushan gold deposit,
Wu, C.F.J., 1983. On the convergence properties of the EM algorithm. Ann. Stat. 11 (1), Jilin Province. Jilin Geol. 14 (3), 1–16 ([in Chinese]).
95–103. Zhong, G.J., Run, T.Y., Cai, Y., 2014. Geological features and origin of Cuocaogou gold
Wu, D.Y., Yang, Y., Song, Q., 1992. Strata bound characteristics of gold, lead and zinc deposit. Western Prospecting Engineering 3, 117–124 ([in Chinese]).
deposits in the Ji’an Group, southern part of Jilin Province. Jilin Geol. 11 (4), 8–16

JME - Volume 13 - Issue 3 - Pages 821-838
No ratings yet
JME - Volume 13 - Issue 3 - Pages 821-838
18 pages
1 s2.0 S0375674224000591 Main
No ratings yet
1 s2.0 S0375674224000591 Main
19 pages
Pixel Pair Feature Method
No ratings yet
Pixel Pair Feature Method
9 pages
Introduction To The Thematic Issue Analysis of Exploration - Carranza PDF
No ratings yet
Introduction To The Thematic Issue Analysis of Exploration - Carranza PDF
3 pages
Zuo 2014
No ratings yet
Zuo 2014
9 pages
geochimical exploration data
No ratings yet
geochimical exploration data
9 pages
StanleyNoble2007MHP
No ratings yet
StanleyNoble2007MHP
12 pages
Manifold learning-based UMAP method for geochemical anomaly identification
No ratings yet
Manifold learning-based UMAP method for geochemical anomaly identification
12 pages
Chap4 Caciagli
No ratings yet
Chap4 Caciagli
18 pages
minerals-12-01035-v2
No ratings yet
minerals-12-01035-v2
16 pages
Ghezelbash2019 PDF
No ratings yet
Ghezelbash2019 PDF
15 pages
Artificial Neural Networks
No ratings yet
Artificial Neural Networks
11 pages
A nearest-neighbour Gaussian process
No ratings yet
A nearest-neighbour Gaussian process
30 pages
GMM (2)
No ratings yet
GMM (2)
25 pages
Denoising of Geochemical Data Using Deep Learn-ing-Implications For Regional Surveys
No ratings yet
Denoising of Geochemical Data Using Deep Learn-ing-Implications For Regional Surveys
26 pages
1 s2.0 S0169136823003694 Main
No ratings yet
1 s2.0 S0169136823003694 Main
13 pages
GMM Methodandapplication
No ratings yet
GMM Methodandapplication
28 pages
Babakhani Mahshid 201404 MSC
No ratings yet
Babakhani Mahshid 201404 MSC
99 pages
Geochemical Anomalies
No ratings yet
Geochemical Anomalies
8 pages
Introduction To Geostatistics - UNP PDF
100% (4)
Introduction To Geostatistics - UNP PDF
124 pages
Introduction To Geostatistics Unp PDF
100% (2)
Introduction To Geostatistics Unp PDF
124 pages
Anomaly Detection in High Dimensional Data
No ratings yet
Anomaly Detection in High Dimensional Data
30 pages
1 s2.0 S037567422300208X Main
No ratings yet
1 s2.0 S037567422300208X Main
43 pages
A Two-Stage Optimized Robust Kernel Density Estima
No ratings yet
A Two-Stage Optimized Robust Kernel Density Estima
36 pages
Meigoony 2013
No ratings yet
Meigoony 2013
11 pages
Geochem Geophys Geosyst - 2024 - ZhangZhou - Geochemistry Automated Machine Learning Python Framework For Tabular Data
No ratings yet
Geochem Geophys Geosyst - 2024 - ZhangZhou - Geochemistry Automated Machine Learning Python Framework For Tabular Data
14 pages
Anomaly Detection
No ratings yet
Anomaly Detection
11 pages
Subspace Histograms For Outlier Detection in Linear Time: Saket Sathe Charu C. Aggarwal
No ratings yet
Subspace Histograms For Outlier Detection in Linear Time: Saket Sathe Charu C. Aggarwal
25 pages
Sequential Gaussian Model
No ratings yet
Sequential Gaussian Model
11 pages
T6_QMchange-point-anomaly
No ratings yet
T6_QMchange-point-anomaly
11 pages
Gazley Et Al. (2015) - Application of Principal Component Analysis and Cluster Analysis To Mineral Exploration and Mine Geology
No ratings yet
Gazley Et Al. (2015) - Application of Principal Component Analysis and Cluster Analysis To Mineral Exploration and Mine Geology
10 pages
5_3-2_Spatial_Environmental_Data_Model_Selection_Long-range_Dependencies
No ratings yet
5_3-2_Spatial_Environmental_Data_Model_Selection_Long-range_Dependencies
3 pages
Zhang Haoze 202112 MSC
No ratings yet
Zhang Haoze 202112 MSC
114 pages
(Carranza and Hale, 1997) A Catchment Basin Approach To The Analysis of Reconnaissance Geochemical-Geological Data From Albay Province, Philippines
No ratings yet
(Carranza and Hale, 1997) A Catchment Basin Approach To The Analysis of Reconnaissance Geochemical-Geological Data From Albay Province, Philippines
15 pages
Analysing Spatial Data Via Geostatistical Methods: Craig John Morgan
No ratings yet
Analysing Spatial Data Via Geostatistical Methods: Craig John Morgan
25 pages
1 s2.0 S0952197622004936 Main
No ratings yet
1 s2.0 S0952197622004936 Main
8 pages
Journal of Geochemical Exploration: Preface
No ratings yet
Journal of Geochemical Exploration: Preface
2 pages
Statistical Data Analysis Explained
93% (27)
Statistical Data Analysis Explained
359 pages
Olea 2006 - A Six Step Practical Approach To Semivariogram Modeling
No ratings yet
Olea 2006 - A Six Step Practical Approach To Semivariogram Modeling
12 pages
RI 7472 Graphical Method Outlier
No ratings yet
RI 7472 Graphical Method Outlier
15 pages
Slope stability prediction for circular mode failure using gradient boosting
No ratings yet
Slope stability prediction for circular mode failure using gradient boosting
14 pages
Background and Threshold: Critical Comparison of Methods of Determination
No ratings yet
Background and Threshold: Critical Comparison of Methods of Determination
16 pages
Model-Based Geostatistics: Lancaster University, UK
No ratings yet
Model-Based Geostatistics: Lancaster University, UK
52 pages
Introduction To Geostatistics - Course Notes: Ye Zhang Dept. of Geology & Geophysics University of Wyoming
No ratings yet
Introduction To Geostatistics - Course Notes: Ye Zhang Dept. of Geology & Geophysics University of Wyoming
36 pages
Geoquimica Ingles
No ratings yet
Geoquimica Ingles
7 pages
YE ZHANG, Introduction To Geostatistics
No ratings yet
YE ZHANG, Introduction To Geostatistics
36 pages
Geosta 1
No ratings yet
Geosta 1
36 pages
A Simple Anomaly Detection For Spectral Imagery Using Co-Occurrence Statistics Techniques
No ratings yet
A Simple Anomaly Detection For Spectral Imagery Using Co-Occurrence Statistics Techniques
4 pages
Machine Learning-Based Mapping For Mineral Exploration: Renguang Zuo Emmanuel John M. Carranza
No ratings yet
Machine Learning-Based Mapping For Mineral Exploration: Renguang Zuo Emmanuel John M. Carranza
5 pages
Geochemical Anomaly and Mineral Prospectivity Mapping in GIS
No ratings yet
Geochemical Anomaly and Mineral Prospectivity Mapping in GIS
2 pages
MultiGaussian Kriging - A Practice To Enhance Delineation of Mineralized Zones
No ratings yet
MultiGaussian Kriging - A Practice To Enhance Delineation of Mineralized Zones
12 pages
Mapping-geochemical-domains-using-stream-sediment-geochemi_2024_Journal-of-G
No ratings yet
Mapping-geochemical-domains-using-stream-sediment-geochemi_2024_Journal-of-G
12 pages
He 2022 Applied Geochemistry
No ratings yet
He 2022 Applied Geochemistry
13 pages
Leperltier C. 1969. A Simplified Statistical Treatment of Geochemical Data by Graphical Representation
No ratings yet
Leperltier C. 1969. A Simplified Statistical Treatment of Geochemical Data by Graphical Representation
13 pages
Nmexample Rev3
No ratings yet
Nmexample Rev3
36 pages
Geoquimica Traduccion 1
No ratings yet
Geoquimica Traduccion 1
11 pages
Anomoly detection
No ratings yet
Anomoly detection
2 pages
Abstract
No ratings yet
Abstract
2 pages
1 Three-Way Decision and Granular Computing
No ratings yet
1 Three-Way Decision and Granular Computing
43 pages
Rough Set For Categorical
No ratings yet
Rough Set For Categorical
21 pages
1 Tri Level Thinking Models of Three Way Decision
No ratings yet
1 Tri Level Thinking Models of Three Way Decision
13 pages
1 Set-Theoretic Models of Three-Way Decision Highlight
No ratings yet
1 Set-Theoretic Models of Three-Way Decision Highlight
16 pages
A Structure Based Approach For Accurate Prediction of Protein
No ratings yet
A Structure Based Approach For Accurate Prediction of Protein
8 pages
Research On Spectral Clustering Algorithms and Prospects
No ratings yet
Research On Spectral Clustering Algorithms and Prospects
5 pages
The Latest Research Progress On Spectral Clustering
No ratings yet
The Latest Research Progress On Spectral Clustering
10 pages
Recent Advances in Clustering A Brief Survey
No ratings yet
Recent Advances in Clustering A Brief Survey
9 pages
Using Boolean Networks To Model Post-Transcriptional Regulation in Gene Regulatory Networks (3W DL)
No ratings yet
Using Boolean Networks To Model Post-Transcriptional Regulation in Gene Regulatory Networks (3W DL)
13 pages
A Survey of Kernel and Spectral Methods For Clustering
No ratings yet
A Survey of Kernel and Spectral Methods For Clustering
15 pages
Detecting N6-Methyladenosine Sites From RNA Transcriptomes Using Random Forest (Deep Learning)
No ratings yet
Detecting N6-Methyladenosine Sites From RNA Transcriptomes Using Random Forest (Deep Learning)
22 pages
A Three-Way Approach For Protein Function Classification (Deep Learning Based 3WC)
No ratings yet
A Three-Way Approach For Protein Function Classification (Deep Learning Based 3WC)
29 pages
A Bayesian Approach For Estimating Protein-Protein
No ratings yet
A Bayesian Approach For Estimating Protein-Protein
12 pages
An Overview of Backdoor Attacks Against Deep Neural Networks and Possible Defences
No ratings yet
An Overview of Backdoor Attacks Against Deep Neural Networks and Possible Defences
27 pages
Development of Baseline (Air Quality) Data in Pakistan: Arifa Lodhi M. Mansha
No ratings yet
Development of Baseline (Air Quality) Data in Pakistan: Arifa Lodhi M. Mansha
16 pages
HJRS - Ehci
No ratings yet
HJRS - Ehci
1 page