0% found this document useful (0 votes)
10 views

Stacking Ensemble Learning

Combining Machine Learning Algorithms

Uploaded by

paragjdutta
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views

Stacking Ensemble Learning

Combining Machine Learning Algorithms

Uploaded by

paragjdutta
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 19

Environ Geochem Health (2024) 46:482

https://doi.org/10.1007/s10653-024-02201-1

ORIGINAL PAPER

Identifying the spatial pattern and driving factors of nitrate


in groundwater using a novel framework of interpretable
stacking ensemble learning
Xuan Li · Guohua Liang · Lei Wang · Yuesuo Yang · Yuanyin Li ·
Zhongguo Li · Bin He · Guoli Wang

Received: 19 February 2024 / Accepted: 27 August 2024 / Published online: 29 October 2024
© Dalian University of Technology, the British Geological Survey (UKRI), Yuesuo Yang, Yuanyin Li, Zhongguo Li 2024

Abstract Groundwater nitrate contamination poses models (gradient boosting decision tree, extreme gra-
a potential threat to human health and environmen- dient boosting, random forest, extremely randomized
tal safety globally. This study proposes an interpret- trees, and k-nearest neighbor), whose outputs were
able stacking ensemble learning (SEL) framework taken as input data for the meta-model. When applied
for enhancing and interpreting groundwater nitrate to the agricultural intensive area, the Eden Valley in
spatial predictions by integrating the two-level het- the UK, the SEL model outperformed the individual
erogeneous SEL model and SHapley Additive exPla- models in predictive performance and generalization
nations (SHAP). In the SEL model, five commonly ability. It reveals a mean groundwater nitrate level
used machine learning models were utilized as base of 2.22 mg/L-N, with 2.46% of sandstone aquifers
exceeding the drinking standard of 11.3 mg/L-N.
Alarmingly, 8.74% of areas with high groundwater
Supplementary Information The online version nitrate remain outside the designated nitrate vulner-
contains supplementary material available at https://​doi.​ able zones. Moreover, SHAP identified that transmis-
org/​10.​1007/​s10653-​024-​02201-1.
sivity, baseflow index, hydraulic conductivity, the
X. Li · G. Liang · B. He · G. Wang percentage of arable land, and the C:N ratio in the soil
School of Hydraulic Engineering, Dalian University were the top five key driving factors of groundwater
of Technology, Dalian 116024, China nitrate. With nitrate threatening groundwater globally,
this study presents a high-accuracy, interpretable,
X. Li · L. Wang (*) · Y. Li
British Geological Survey, and flexible modeling framework that enhances our
Keyworth, Nottingham NG12 5GG, UK understanding of the mechanisms behind groundwa-
e-mail: [email protected] ter nitrate contamination. It implies that the interpret-
able SEL framework has great promise for providing
Y. Yang
Key Laboratory of Groundwater Resources valuable evidence for environmental management,
and Environment, Ministry of Education, Jilin University, water resource protection, and sustainable develop-
Changchun 130021, China ment, particularly in the data-scarce area.
Y. Li
Department of Geography, Durham University, Keywords Water quality · Groundwater · Spatial
Durham DH1 3LE, UK distribution · Driving factors · Ensemble learning ·
Interpretable machine learning
Z. Li
Liaoning Water Affairs Service Center, Shenyang 110003,
China

Vol.: (0123456789)
482 Page 2 of 19 Environ Geochem Health (2024) 46:482

Introduction to agricultural sources, and deliver measures (EU,


1991; Musacchio et al., 2020). The recent Nitrate
Groundwater is a valuable resource, serving as the Vulnerable Zones (NVZs) designation in 2021
primary source of drinking water for over a third of delineated four groundwater NVZs in the Eden Val-
the population in the world (IAHS, 2023). However, ley (EA, 2021). To address the groundwater nitrate
with the increasing human activities, excess nitro- pollution in the study area, it is crucial to investi-
gen released into the subsurface environment causes gate the spatial distribution of groundwater nitrate
groundwater nitrate contamination (Castaldo et al., concentrations and gain a thorough understanding
2021; Liu et al., 2023; Mahlknecht et al., 2023). It of the impacts of environmental variables.
poses a threat to human health and environmental Accurate groundwater quality spatial distribution
security, which has attracted global attention (Kaur is essential for comprehending current contaminant
et al., 2020; Knoll et al., 2019; Ransom et al., 2022). levels, particularly for the data-scarce area. However,
Nitrate ingestion by humans is related to methemo- conventional spatial interpolation methods typically
globinemia, adverse pregnancy outcomes, thyroid dis- depend on geographical information while neglecting
ease, and specific cancers (Picetti et al., 2022; Rich- the impacts of environmental factors (Mainali et al.,
ards et al., 2022). Due to the importance of protecting 2019), which can result in potential high deviation
public health, the World Health Organization (WHO) and uncertainty in predictions. On the other hand, fre-
set the guideline value of 50 mg/L N ­ O3 (equivalent quent water quality monitoring and testing is costly
to 11.3 mg/L-N) for nitrate concentration in drinking and time-consuming, and data availability is often
water (WHO, 2022). Therefore, it is crucial to protect delayed (Li et al., 2022). By contrast, machine learn-
groundwater from nitrate pollution and limit nitrogen ing (ML) is a new data-driven model that can iden-
inputs. To achieve the goal, it is necessary to identify tify the complex and non-linear relationship between
the spatial pattern and important influential factors of input and target variables, which has developed rap-
groundwater nitrate. idly in recent decades. With the advantages of high
The Eden Valley is a largely rural area in the UK, accuracy, low cost, and time-saving, ML has been
and groundwater is widely used for public water increasingly applied in groundwater investigations
supply, industry, and minor private supplies for and has shown promising results (Barzegar et al.,
farms (Butcher et al., 2003). Nevertheless, ground- 2021; Iqbal et al., 2023; Nadiri et al., 2023; Ransom
water nitrate pollution is a serious problem in the et al., 2022).
study area, which is primarily caused by intensive Nevertheless, it is inevitable that individual ML
farming practices (Wang & Burke, 2017). The models may selectively capture local patterns and be
extensive application of fertilizers and manure prone to noise or errors, which can lead to poor per-
in arable land in the 1980s significantly increased formance on unseen data. In addition, although ML
nitrogen levels in the soil (Wang et al., 2012). has shown promise in predicting variables, its com-
Moreover, it is reported that atmospheric nitrogen plex structure, like an intelligent black-box, presents
deposition is recognized as an important nitrogen challenges in understanding the mechanisms (Nearing
source for woodland soils in the UK (Vanguelova et al., 2021), such as support vector regression (SVR)
et al., 2024). Nitrogen can be converted into nitrate with a non-linear kernel and artificial neural network
through nitrification and then leach into aquifers via (ANN) with multiple hidden layers, in particular for
infiltration, posing a severe threat to groundwater the ensemble learning model within a multi-layer
quality. Notably, in areas with a thick unsaturated structure. Otherwise, ranking the features through
zone in the Eden Valley, the peak nitrogen loading multiple transformations is essentially meaningless.
has not reached the groundwater table (Wang et al., Tree-based models, like extreme gradient boosting
2013). To protect waters against nitrate pollution, (XGB) and random forest (RF), enable interpretabil-
the EU proposed Nitrates Directive 91/676/EEC ity of the model; whereas, their explanations are lim-
in 1991, which requires the designation of certain ited to the training data, and XGB can only offer the
areas as Nitrate Vulnerable Zones (NVZs) where global explanation. This hinders water managers from
nitrate in surface water or groundwater has exceeded leveraging machine learning predictions to formulate
or could exceed 50 mg/L nitrate (11.3 mg/L-N) due targeted safeguard policies.

Vol:. (1234567890)
Environ Geochem Health (2024) 46:482 Page 3 of 19 482

To tackle the dual challenge of predictive per- in the Eden Valley, UK; and (4) identify key driving
formance and interpretability, combining stacking factors of groundwater nitrate and quantitatively ana-
and the interpretable method offers a potential solu- lyze their influence.
tion. Stacking ensemble learning (SEL) is a power-
ful ensemble learning method, and it can enhance
overall prediction accuracy by integrating the outputs
of multiple base models to obtain the final predic- Data and method
tion based on the “wisdom of crowds” (Wang et al.,
2021). To decrease the risk of overfitting, it is com- Study area
monly coupled with cross-validation (CV) to gener-
ate new training data for the meta-model. The SEL The Eden Valley is located in Cumbria, North West
model exhibits great promise of applications in many England, covering approximately 2308 ­km2 (Fig. 1).
fields, e.g., hydrology (Lu et al., 2023; Shams et al., The River Eden origins from the Pennines and dis-
2021), meteorology (Gu et al., 2022; Morshed-Bozo- charges into the Solway Firth in the northwest, run-
rgdel et al., 2022), and environment (Sakizadeh et al., ning northwards and joined by tributary rivers, such
2024; Wang et al., 2021). Given its superior model as the River Eamont, the River Irthing, and the River
performance and generalization in previous studies, Caldew. The meteorological, hydrology, and hydro-
the SEL model is required to be introduced to accu- geology conditions in the Eden Valley are shown in
rately predict groundwater contamination, especially Fig. S1. In the study area, the elevation varies from
in the data-scarce area. On the other hand, Shapely 945 m to the sea level, which is relatively high in the
addictive explanations (SHAP) is an advanced inter- southwest and the east but low in the valley. It has a
pretable method that can not only provide global temperate marine climate, with an average annual pre-
explanations and feature importance but also explain cipitation of approximately 1000 mm/a in the study
an individual prediction (Lundberg et al.; Lundberg & area and exceeding 1500 mm/a on higher ground
Lee, 2017). It can also identify the positive and nega- (Butcher et al., 2003). The population density of the
tive effects on predictive results, as well as linear and Eden Valley is as low as about 0.2 person/ha, lower
nonlinear relationships. Thus, SHAP is a valuable than most districts in England. The major sources of
tool in enhancing model transparency and interpreta- income are agriculture, especially livestock rearing,
bility, facilitating a deeper insight into the ML model tourism, and some industries (Butcher et al., 2003).
(Li et al., 2022). However, it is rarely used in ground- In the Eden Valley, the Permo-Triassic rocks lie
water pollution research. in a fault-bounded basin bounded southwest by the
In this study, we adopt a two-level heterogeneous Lake District and northeast by the North Pennines.
SEL model, consisting of five base models at level As shown in Fig. 1, the principal aquifers in this
0 (gradient boosting decision tree (GBDT), XGB, region are the Penrith Sandstones and St Bees Sand-
RF, extremely randomized trees (ET), and k-nearest stones, which are thick sequences of Permo-Triassic
neighbor (KNN)), and a meta-model at level 1 (KNN) sandstones with moderate to high permeability and
that uses the output from the base models. SHAP is porosity. These sandstones are separated by the Eden
employed to identify important driving factors and Shale, an aquitard mainly composed of mudstone
quantify their contributions. To our knowledge, the and siltstone. In the study area, approximately 75%
SEL model combined with the interpretable ML of the sandstone aquifers are covered by superficial
method has not been used to analyze contaminants in deposits, significantly impacting recharge and dis-
water before, and this study attempts to fill this gap. tribution (Allen et al., 2010). Hydraulic conductiv-
The main objectives of this study are to (1) develop ity (K) ranges from 3.5 × ­10–5 to 26.2 m/day for the
a novel two-level interpretable stacking ensemble Penrith Sandstones and from 0.048 to 3.5 m/day for
learning (ISEL) framework for analyzing groundwa- St Bees Sandstones. The wide range is primarily due
ter nitrate; (2) compare the model performance and to the varying degree of cementation of the sand-
generalization ability of the SEL model to five indi- stone (Allen et al., 1997). Carboniferous limestone is
vidual ML models; (3) map the spatial distribution of mainly located on the edges of the study area, charac-
nitrate in groundwater and pinpoint high nitrate areas terized by very low porosity and permeability. They

Vol.: (0123456789)
482 Page 4 of 19 Environ Geochem Health (2024) 46:482

Fig. 1  Lithology, well


locations, groundwater
nitrate concentrations, and
NVZs in the study area

provide base flow for the streams and tributaries of (NVZs) designation in 2021, there are four ground-
the catchment subregion of the River Eden. water NVZs in the Eden Valley (EA, 2021). i.e., the
The Eden Valley is largely rural and mainly cov- Brampton Sand Sheet, Penrith, Skirwith, and Kirby
ered by grassland, mountains, and arable land. It Thore NVZs. Therefore, it is necessary to under-
is a notable concern that intensive farming activi- stand the nitrate contamination level in groundwa-
ties, including fertilizers and manure slurry appli- ter and analyze its key driving factors to tackle the
cations, lead to groundwater nitrate contamination. nitrate challenge in the Eden Valley.
According to the recent Nitrate Vulnerable Zones

Vol:. (1234567890)
Environ Geochem Health (2024) 46:482 Page 5 of 19 482

Nitrate concentration data unsaturated zone thickness, and aquifer properties.


Then, all of the environmental variables at the well
Groundwater nitrate concentration data were col- locations and the center of each element in the grid
lected from the Water Quality Archive (Beta), which map of the Eden Valley (200 m × 200 m), except for
was carried out by the EA (EA, 2012). In the Eden land use, were extracted as point data using ArcGIS.
Valley, there are 1107 groundwater nitrate concentra- To reduce multicollinearity in the dataset, prevent
tion measurements from 74 monitoring wells whose overfitting and enhance explanation, the Pearson cor-
locations are shown in Fig. 1 between 2012 and 2021. relation coefficient (r) between the environmental
10.66% of nitrate values were below the method variables was calculated, as illustrated in the heatmap
detection limit (0.196 mg/L-N), and they were set of correlation matrix (Fig. 2). Based on the absolute
to half the limit (0.098 mg/L-N). For the well with value of r exceeding 0.70, four highly correlated vari-
multiple nitrate measurements in one year, the annual ables exhibiting a higher average absolute value of r
mean value was calculated to represent its aver- with other variables were removed (Kuhn & Johnson,
age nitrate level in that year. Ultimately, 549 nitrate 2013), including precipitation minus evaporation,
concentration data between 2012 and 2021 were nitrogen fertilizer application rates, nitrogen in the
used for training and testing the predictive model. In soil, and available water capacity. Despite the aver-
addition, to decrease the impact of very high values, age absolute correlation of the percentage of built-up
nitrate concentrations were ­log10 transformed before area being greater than that of population, the great
modeling. The ­log10 transformed values represented concern about the effects of land use on groundwa-
the response variable for the machine learning mod- ter pollution led to the exclusion of the population.
els, and the predictions were then converted back to Similarly, soil sand percentage and DEM were also
nitrate concentrations after modeling. Nitrate values reserved, which are essential variables in nitrate pre-
in this study represent nitrate nitrogen, with the unit dictions in previous research (Wheeler et al., 2015;
expressed as mg/L-N. Nolan et al., 2014). Eventually, 21 environmental
variables were selected as input features for the ML
Predictor variables and feature engineering models.
In addition, normalization was applied to ensure
We compiled a set of 26 predictor variables that rep- that each feature contributes equally to the result. It
resented climate, hydrology, soils, geology, hydroge- can help decrease the training time and improve the
ology, and land use, as listed in Table S1. Superficial model performance. In this study, all the predic-
depth data was from British Geological Survey (BGS, tor variables were normalized to the range of 0 to 1
2020). Soil physical and chemical characteristics through min–max normalization before being utilized
were obtained from the European Soil Data Centre as inputs, as Eq. (1):
(ESDAC) (Ballabio et al., 2016, 2019). The dataset
X − Xmin
of precipitation and evaporation was from the UK X� = (1)
Met Office (Met Office et al., 2018). Furthermore, Xmax − Xmin
the baseflow index (BFI) (Boorman et al., 1995) and where X ′ represents the normalized value; X is the
land use (Morton et al., 2014) were collected from original value, and Xmax and Xmin are the maximum
the UK Centre for Ecology and Hydrology (CEH). and minimum of the original data, respectively.
In the Eden Valley, the main land use was grassland
(58.90%), woodland (9.98%), arable land (9.71%),
built-up areas (1.98%), and mountain (18.64%), Interpretable stacking ensemble learning (ISEL)
respectively. The former four land use types were framework
used to analyze the impacts on the groundwater
nitrate in this study, and the contributing area was cal- To improve the model performance and generaliza-
culated within a 500 m radius circular buffer (Ransom tion and interpret the predictive model, we designed
et al., 2022). Moreover, some variables were obtained an ISEL framework, as shown in Fig. 3. The ISEL
from the previous study (Wang & Burke, 2017), framework for groundwater nitrate mapping con-
including elevation, groundwater average recharge, sists of four steps: (1) data pre-processing; (2)

Vol.: (0123456789)
482 Page 6 of 19 Environ Geochem Health (2024) 46:482

Fig. 2  The heatmap of Pearson correlation matrix

hyperparameter tuning and model performance evalu- the base models, capture more complex patterns, and
ation; (3) creation of groundwater nitrate distribution reduce the variance and bias of the individual mod-
map; and (4) key driving factors identification and els by integrating the predictions of multiple models.
quantitative analysis. As a result, the SEL model typically performed bet-
ter than the individual models because of the model
Stacking ensemble learning (SEL) diversity, bias reduction, and enhanced robustness.
In the SEL model, the models in the first layer are
Stacking, also known as a stacked generalization, is trained on the original dataset, while the models in
a powerful ensemble learning technique in machine subsequent layers are trained on the outputs of the
learning. It aims to improve predictive performance previous layer, as illustrated in Fig. 4.
by relying on the “wisdom of the crowds”. The main In this study, we employed a two-level SEL model,
idea of stacking is to extract more information from consisting of five base models (GBDT, XGB, RF, ET,

Vol:. (1234567890)
Environ Geochem Health (2024) 46:482 Page 7 of 19 482

Fig. 3  The framework


of interpretable stacking
ensemble learning (ISEL)
for identifying the spatial
distribution and driving
factors of nitrate in ground-
water

KNN) and a meta-model that uses the outputs from this process ten times, we obtained ten predictive val-
the base models. These models were selected because idation sets, which were then combined to form a new
they are based on different theories and structures, are feature set for training the meta-model. Furthermore,
widely used, and have demonstrated high accuracy at level 1, the average predictions (in orange) for the
in previous studies. Moreover, the tenfold CV gen- testing data from each iteration (in dark green) were
erator was applied in the training phase to improve used as a feature of new testing data for the meta-
model generalization capability. As shown in Fig. 4, model. Consequently, the five base models provided
the training data was divided into ten folds randomly; five columns of new features as new training data and
nine folds (in light grey) were used for training the testing data for the meta-model. Finally, we can tune
models and one remaining fold (in dark blue) was and fit the meta-model using new training data and
reserved for validation in each iteration. By repeating evaluate model performance using new testing data.

Vol.: (0123456789)
482 Page 8 of 19 Environ Geochem Health (2024) 46:482

Fig. 4  The workflow of the stacking ensemble learning (SEL) model

To implement the methodology, we used the technique was performed on the training data dur-
Scikit-Learn library (Pedregosa et al., 2011) in ing model tuning to control model overfitting and
Python 3.7 (Van Rossum & Drake, 2009) for GBDT, enhance model generalizability.
RF, ET, KNN, and SEL. For the XGB model, the After determining the optimal combination of
XGBoost package in Python (Chen & Guestrin, 2016) hyperparameters, the whole training data was uti-
was applied. lized to refit the CV-tuned model, and the testing
data was then used to predict and compare model
Hyperparameter tuning performance. Therefore, nitrate spatial predictions
can be produced based on the 21 predictor varia-
Following the commonly utilized 8:2 dataset split- bles and the CV-tuned model using Python. Finally,
ting ratio (Joseph, 2022), ML models were devel- model predictions for mapping the nitrate spatial
oped using the training data from the first eight distribution in groundwater were performed using
years (n = 472, 2012–2019), and the model perfor- ArcGIS.
mance was evaluated with the independent test-
ing data from the subsequent two years (n = 77,
2020–2021). During model tuning, the optimal Model performance evaluation metrics
combination of hyperparameters was selected using
the Tree-structured Parzen Estimator (TPE) algo- Three evaluation metrics were utilized to compare the
rithm (Bergstra et al., 2011) combined with the ten- predictive performance of different machine learn-
fold CV. TPE algorithm, a Bayesian optimization ing models: mean absolute error (MAE), root mean
approach based on Gaussian mixture models, runs squared error (RMSE), and coefficient of determina-
faster and performs more efficiently than Gaussian tion ­(R2). MAE and RMSE reflect the average abso-
process models. It was conducted using the Python lute difference and the average distance between the
package Hyperopt (Bergstra et al., 2015). The initial nitrate predictions and observations, respectively, as
range for the hyperparameter to be optimized was presented in Eqs. (2) and (3). ­R2 indicates the pro-
assigned according to relevant articles and docu- portion of variance in the target variable that can be
ments, and the model was trained 1000 times to explained by the predictor variables, calculated as
select the optimal combination of hyperparameters Eq. (4). Moreover, the mean ­R2 of tenfold CV was
using the TPE algorithm. Moreover, tenfold CV used to evaluate model generalization.

Vol:. (1234567890)
Environ Geochem Health (2024) 46:482 Page 9 of 19 482

∑n � �
�ŷi − yi � to 2021, with a mean concentration and a standard
MAE = i=1 (2) deviation of 6.31 mg/L-N and 6.70 mg/L-N, respec-
n
tively. The 25th, 50th, and 75th percentile groundwa-
� ter nitrate concentrations were 0.94, 4.41, and 9.87
∑n � �2
ŷi − yi mg/L-N, respectively. Overall, 20.79% of the sam-
RMSE = i=1 (3)
n ples exhibited high nitrate concentrations, exceeding
the maximum admissible concentration (MAC) of
∑n � �2 nitrate in water for human consumption (11.3 mg/L-
i=1
ŷi − yi N), as set by the European Union (EU) in the Drink-
(4)
2
R =1− ∑ � �2
n ing Water Directive 80/778/EEC. These high nitrate
i=1 yi − yi
concentrations were mainly located in St Bees Sand-
where yi is the ith observed value; ŷi is the ith predicted stones and Penrith Sandstones, the central part of the
value;yi is the mean value of the observed values; n is Eden Valley. The percentage of wells with groundwa-
the number of samples. ter nitrate below 2 mg/L-N was the largest (37.16%).
These wells were concentrated in the limestone and
Model interpretability north of the St Bees Sandstones, the catchment subre-
gion throughout the Eden Valley.
SHAP is a recently developed unified measure of The whole nitrate concentration data between
feature importance, which can help to improve the 2012 and 2021 (n = 549) was divided into a train-
understanding of the predictions made by ML models ing set (n = 472, 2012–2019) and a testing data set
(Lundberg & Lee, 2017). It is based on game theory (n = 77, 2020–2021), as shown in Fig. S2. Training
and uses an additive feature attribution method where data ranged from 0.098 to 52.06 mg/L-N, and testing
the model output is a linear combination of input data ranged from 0.098 to 30.00 mg/L-N. Moreover,
variables. The SHAP value represents the marginal the first, second, and third quartiles of training data
contribution of each feature to each prediction (Lun- are 0.94, 4.44, and 9.63 mg/L-N, respectively, which
dberg et al., 2020). Compared to previous feature are 1.00, 4.20, and 11.00 mg/L-N for testing data. In
importance methods, SHAP provides richer explana- general, the distributions of the training and testing
tions that interpret models locally and globally, and datasets were similar, which may help mitigate the
the global explanations are built according to local tendency for the method to overfit the training data.
explanations, ensuring consistency. It can also iden-
tify whether the contribution of each input feature is Hyperparameter tuning and model performance
positive or negative based on SHAP values.
The SHAP method was applied in this study to The optimal hyperparameters of ML models were
analyze the local and global feature importance to determined using the TPE optimization algorithm
understand the importance and influence of driving combined with the maximum tenfold CV mean R ­2
factors on groundwater nitrate spatial predictions, as criterion by training 1000 times (Table S2). Model
well as model contributions from base models to the performance was compared according to the evalua-
meta-model. The SHAP analysis was implemented tion metrics for the testing data: MAE, RMSE, and R ­ 2
using the Python package SHAP (Lundberg & Lee, (Table 1). All individual and SEL models produced
2017). satisfying predictions and were considered accept-
able. Based on the testing R ­ 2, the model performance
ranked in the following order: SEL > GBDT > XGB
Results and discussion > RF > ET > KNN. Compared to the five individual
models, the SEL model had the lowest MAE (0.1229)
Groundwater nitrate data summary and RMSE (0.2586) and the highest R ­ 2 (0.8644) for
testing data, which indicated that the SEL model out-
As shown in Fig. S2, for the whole dataset (n = 549), performed the other five individual models in pre-
the annual average groundwater nitrate concentra- dictive performance. Furthermore, in terms of gen-
tions ranged from 0.098 to 52.06 mg/L-N from 2012 eralization ability, models ranked the same as the

Vol.: (0123456789)
482 Page 10 of 19 Environ Geochem Health (2024) 46:482

Table 1  Model performance metrics for the models: gradient boosting decision tree (GBDT), extreme gradient boosting (XGB),
random forest (RF), extremely randomized trees (ET), k-nearest neighbors (KNN), and stacking ensemble learning (SEL)
Model Tenfold CV R­ 2 Training data (n = 472) Testing data (n = 77)
(mean ± std.) 2
MAE RMSE R MAE RMSE R2

GBDT 0.8416 ± 0.0971 0.0999 0.2072 0.9000 0.1254 0.2618 0.8610


XGB 0.8400 ± 0.1082 0.0997 0.2056 0.9016 0.1263 0.2651 0.8575
RF 0.8368 ± 0.0910 0.1023 0.2114 0.8960 0.1271 0.2680 0.8544
ET 0.8363 ± 0.0954 0.1060 0.2140 0.8934 0.1315 0.2763 0.8452
KNN 0.8240 ± 0.1392 0.0958 0.2078 0.8994 0.1283 0.2859 0.8342
SEL 0.8500 ± 0.0702 0.1037 0.2112 0.8961 0.1229 0.2586 0.8644
The units of MAE and RMSE are ­log10 (mg/L-N), and std. represents standard deviation. Bold text indicates the best performance
according to the evaluation metric

we retransformed the predicted values back to nitrate


concentrations. In Fig. 5, the lower and upper ends
of the box denote the 25th and 75th percentiles (Q1
and Q3), the horizontal line inside the box represents
the 50th percentile (the median), and the cross indi-
cates the mean value. Moreover, the lower whisker
represents the minimum nitrate value, and the upper
whisker denotes the value of Q3 + 1.5(Q3 − Q1 ),
excluding the outliers that drawn as points.
In Fig. 5, it can be observed that the minimum
(0.10 mg/L-N) and the first quartile (0.96–1.11 mg/L-
N) of nitrate predictions from all models were similar
to those of the observation (0.10 mg/L-N, 0.94 mg/L-
N). Whereas the third quartile (8.91–9.35 mg/L-N)
and the upper whisper (20.03–20.15 mg/L-N) from
the five individual models were apparently lower than
Fig. 5  The box plots of observed (OBS) and predicted those of the SEL model (10.24 mg/L-N, 21.22 mg/L-
groundwater nitrate concentrations from the models: gradi-
ent boosting decision tree (GBDT), extreme gradient boosting N) and observation (9.94 mg/L-N, 23.35 mg/L-N),
(XGB), random forest (RF), extremely randomized trees (ET), indicating that the predictions for five individual
k-nearest neighbors (KNN), and stacking ensemble learning models were biased in high values. By contrast, the
(SEL) SEL model had a more reliable range of groundwa-
ter nitrate predictions, closer to the observations
model performance based on the mean ­R2 of tenfold than the other five individual models. Moreover, the
CV. The SEL model had the highest CV mean ­R2 of mean value of nitrate predictions from the SEL model
0.8500, which was 2.68–4.90% higher than the other (5.60 mg/L-N) was comparable to the observation
models, and the smallest CV standard deviation of (5.65 mg/L-N), which is marked by a cross in Fig. 5.
0.0702, suggesting better generalization and stability. In comparison, the mean values of the predictions
Thus, in contrast with the five individual models, the from the individual models were 5.33–5.42 mg/L-N,
two-level heterogeneous SEL model enhanced predic- suggesting that their predicted results were gener-
tive performance and generalization ability. ally lower than the observed values. Furthermore, the
The box plots of predicted and observed groundwa- standard deviation of predictions from the SEL model
ter nitrate concentrations were displayed in Fig. 5, vis- (5.34 mg/L-N) was also quite close to the observa-
ually representing the spread of nitrate values. To con- tions (5.40 mg/L-N), revealing that its predictions
trast the predicted and observed nitrate concentrations, were dispersed similarly to the observation. Overall,

Vol:. (1234567890)
Environ Geochem Health (2024) 46:482 Page 11 of 19 482

the distribution of nitrate predictions from the SEL concentrations exceeding the MAC of 11.3 mg/L-N
model was comparable to that of the observations at only occupied 0.79% of the total, the lowest propor-
the training and testing phases in terms of the range, tion within the study area, and these areas accounted
mean value, and standard deviation. for 2.46% of the sandstone aquifers.
From the analysis above, it can be concluded that Figure 6 shows the 200 m × 200 m spatial distribu-
the SEL model exhibited superior predictive perfor- tion grid map of predicted groundwater nitrate con-
mance and generalization, indicating that its nitrate centrations for the SEL model in the Eden Valley,
predictions were more reliable. Although GBDT and representing the average annual nitrate level between
XGB performed relatively well, their high nitrate pre- 2012 and 2021. The results suggested that its distribu-
dictions were obviously lower than those of the SEL tion pattern is similar to the nitrate input reported in
model and observations. This is probably because the the previous study (Wang & Burke, 2017). Moreover,
ensemble tree regression models typically reduce the nearly 91.26% of high nitrate predictions exceeding
variance of predictions but leave bias, resulting in 11.3 mg/L-N are located inside the NVZs, revealing
negative and positive bias for big and small values, that the predicted spatial distribution of groundwater
respectively (Belitz & Stackelberg, 2021; Zhang & nitrate for the SEL model is reliable. As illustrated in
Lu, 2012). Thus, the SEL model can be a powerful Fig. 6, predicted groundwater nitrate concentrations
tool for accurately predicting groundwater nitrate in most of the central part of the valley, were gener-
concentrations at unsampled locations. ally above 2 mg/L-N, whereas concentrations in other
aquifers were predominantly below 2 mg/L-N. Fur-
Nitrate predictions and spatial distribution thermore, the high nitrate concentrations exceeding
11.3 mg/L-N were concentrated in the Penrith Sand-
After the training and testing phases, the SEL model stone aquifer where arable land and grassland pre-
was applied to predict groundwater nitrate concentra- dominated. It is evident that the groundwater nitrate
tions across the 200 m × 200 m grid map covering the contamination is primarily attributed to agriculture
Eden Valley using environmental variables. Table 2 in the study area, which is in line with earlier inves-
summarizes the percentages of different concentration tigations (Allen et al., 1997; Butcher et al., 2003).
ranges of groundwater nitrate spatial predictions for Therefore, it is necessary to control the application
the SEL model. According to the statistical metrics, of N-fertilizers and animal manure to reduce nitrogen
the predicted nitrate concentrations across the Eden pollution sources in high groundwater nitrate areas
Valley ranged from 0.11 to 27.27 mg/L-N, consist- and surrounding regions, as required by the NVZ
ent with the observations excluding the outliers. The regulations (EU, 1991). In addition, drip irrigation is
median and mean values for nitrate spatial predictions suggested as a substitute for flood irrigation to limit
were 1.10 and 2.22 mg/L-N, respectively, indicating nitrogen leaching from the bottom of the soil.
that nitrate concentrations are generally low at most According to the nitrate spatial predictions from the
locations in the study area. As shown in Table 2, the SEL model, it is worth noting that about 8.74% of the
percentage of nitrate concentration classes decreased high groundwater nitrate areas are located outside the
as the concentration increased. The predicted nitrate designed NVZs. These areas are concentrated in the
concentrations in the range of 0–2 mg/L-N accounted southeast and northeast of the Penrith NVZ, as well as
for the largest proportion at 67.36%, followed by the southeast of the Kirby Thore NVZ, and have the
the 2–5 mg/L-N (16.78%), 5–8 mg/L-N (10.85%), potential to exacerbate groundwater nitrate contamina-
and 8–11.3 mg/L-N (4.22%) classes, respectively. tion without any mitigative measures. Based on the pre-
By contrast, the areas with high groundwater nitrate vious study (Wang & Burke, 2017), they are areas with
high to moderate high nitrogen input. Thus, it is neces-
Table 2  Percentages of different ranges of groundwater nitrate sary to consider delineating these areas into the NVZs in
spatial predictions in the Eden Valley, utilizing the stacking the future and formulate targeted management strategies.
ensemble learning (SEL) model Moreover, a small portion of built-up areas in the central
Nitrate (mg/L-N) 0–2 2–5 5–8 8–11.3 ≥ 11.3 part of the valley are quite close to high nitrate locations.
Hence, water managers should be cautious about poten-
Percentage (%) 67.36 16.78 10.85 4.22 0.79
tial health issues when directly using local groundwater.

Vol.: (0123456789)
482 Page 12 of 19 Environ Geochem Health (2024) 46:482

Fig. 6  Spatial distribution


of predicted nitrate con-
centrations in groundwater
for the SEL model at 200
m × 200 m resolution in the
Eden Valley

Quantitative analysis of driving factors and base the relationship between environmental variables
models and groundwater nitrate concentrations. Figure 7a
illustrates the global variable importance ranking
Contributions of driving factors to nitrate predictions based on the mean absolute value of SHAP values
shown on the x-axis, denoting the average impact
The importance and influence of the driving fac- on model output magnitude. Figure 7b presents the
tors underlying the nitrate predictions on the train- SHAP summary plot as a violin plot, illustrating the
ing data were quantitatively analyzed using the global distribution of feature influence. The y-axis
SHAP method, offering valuable insights into lists the top ten most important variables, and the

Vol:. (1234567890)
Environ Geochem Health (2024) 46:482 Page 13 of 19 482

x-axis represents the SHAP value of each instance hydrology, land use, soil organic matter, and topogra-
for the feature. Moreover, the width of the vio- phy. Transmissivity (T) and K are essential hydrogeo-
lin plot denotes the frequency of the SHAP value, logical parameters representing the ability of an aqui-
and the color indicates the average feature value fer to transmit and conduct water, both of which are
at that position, with red and blue signifying high related to groundwater flow rate (Wang et al., 2013).
and low relative values of the variables, respec- They are the most and the third-most important driv-
tively. Figure 7c displays the local SHAP values for ing factors on groundwater nitrate predictions for the
each value of the top ten crucial driving factors and SEL model, respectively, and both have a positive
shows the relationship between the environmental impact (Fig. 7b and c), consistent with the finding of
variables (x-axis) and SHAP values (y-axis), pro- earlier research (Wang & Burke, 2017). High T and K
viding insights into how nitrate predictions vary can accelerate groundwater flow, thereby facilitating
with the increasing values of the variables. the migration and dispersion of nitrate (Jang et al.,
As shown in Fig. 7a, the top ten crucial variables 2017). Moreover, rapid groundwater flow can reduce
for the SEL model can be generally categorized the potential for nitrate to interact with microorgan-
into the following five categories: hydrogeology, isms and other substances, hindering denitrification

Fig. 7  SHAP analysis for training data. a The average absolute value of SHAP values, b SHAP values, and c SHAP dependence
plots of the top ten essential variables for the stacking ensemble learning (SEL) model

Vol.: (0123456789)
482 Page 14 of 19 Environ Geochem Health (2024) 46:482

processes and preventing effective nitrate removal the likelihood of nitrate leaching into groundwater
(Rivett et al., 2008). This ultimately increases the (Dise & Wright, 1995).
risk of nitrate pollution in groundwater (Aller et al., Conversely, groundwater nitrate concentrations
1987). tended to decrease with increasing C:N ratio and
BFI is a critical index that reflects the contribu- organic carbon content in the soil, which ranked fifth
tion of groundwater to river flow. It emerged as the and seventh in importance. Elevated C:N ratios and
second most important variable and exhibited a posi- soil organic carbon can restrict the availability of
tive correlation with nitrate predictions. This can be nitrogen sources essential for microbial metabolism
attributed to the fact that BFI is positively correlated (Hoang et al., 2022). It has been reported that a high
with groundwater recharge (r = 0.69) (Zomlot et al., C:N ratio in soil adversely impacts ammonifying bac-
2015). A higher BFI signifies greater recharge, which teria, facilitating soil organic nitrogen conversion into
can enhance the transport of nitrogen from the sur- ammonium nitrogen (Yang et al., 2023). The nitri-
face to the aquifer and facilitate nitrate leaching into fication process is closely related to the ammonium
groundwater. It can potentially raise groundwater nitrogen production rate (Booth et al., 2005), and
nitrate levels (Nolan and Hitt, 2006), particularly in thus, insufficient nitrogen can significantly hamper
areas with high agricultural nitrogen loading (Böhlke, the nitrification process. In addition, an abundance
2002). Although increased recharge can contribute to of organic carbon in soil can strengthen the activity
the dilution of groundwater nitrate, in agriculturally of denitrifying bacteria, which are mostly facultative
intensive areas such as the Eden Valley, this effect anaerobic heterotrophs, favoring denitrification and
is likely less significant than the substantial nitrate reducing nitrate levels (Sheng et al., 2018). Conse-
leaching into the groundwater. In contrast, potential quently, a high C:N ratio and increased organic car-
evapotranspiration (PET) showed a negative correla- bon content can help prevent nitrate accumulation
tion with recharge (r = -0.35) (Walker et al., 2019). in soil and reduce nitrate leaching losses (Bai et al.,
Therefore, increased PET suggests reduced recharge, 2021), thereby decreasing the risk of nitrate pollution
which may limit contaminants leaching into the aqui- in groundwater.
fer, resulting in lower nitrate levels in groundwater. Moreover, elevation was ranked as the eighth most
Furthermore, the percentage of arable land and significant influencing factor. As shown in Fig. 7c,
woodland within a 500 m radius circular buffer the SHAP value implied a positive correlation with
ranked fourth and sixth in the SEL model, respec- elevation, peaking at around 130 m before gradually
tively, and were associated with high nitrate concen- decreasing. Specifically, 86.4% of the samples with
trations, as shown in Fig. 7c. The possible reason is positive SHAP values fall within the elevation range
that the arable land percentage and fertilizer applica- of 60–150 m, where the positive influence on high
tion rate are highly correlated (r = 0.72) (Fig. 2), in nitrate predictions is stronger than the negative, as
line with the previous findings (Butcher et al., 2003; illustrated in Fig. S3a. These elevations are predomi-
Ransom et al., 2022). Extensive fertilizer and manure nantly located along the River Eden (Fig. S2), which
utilization in arable land can enhance crop growth is suitable for farming. Fig. S3b reveals that when the
and promote nitrification (Zhang et al., 2013). Thus, percentage of arable land exceeds 5%, 72.8% of the
excessive nitrogen unabsorbed by crops likely leads samples are situated at an elevation ranging from 60
to an elevated nitrate level. Furthermore, the posi- to 150 m, holding a significantly larger proportion of
tive influence of woodland on elevated groundwater samples compared to other elevation intervals. There-
nitrate levels is possibly due to abundant nitrogen fore, prevalent agricultural practices on arable land at
from various sources, such as atmospheric nitrogen these elevations, including the applications of chemi-
deposition, litter decomposition, and biological nitro- cal fertilizers and manure, likely contribute to the ele-
gen fixation (Sardar et al., 2023). Notably, atmos- vated groundwater nitrate level.
pheric nitrogen deposition in most woodlands in the In addition, it should be noted that a thicker unsat-
UK surpasses the critical loads (Vanguelova et al., urated zone is associated with higher groundwater
2024), enhancing nitrogen mineralization and nitrifi- nitrate concentrations (Böhlke, 2002) This is prob-
cation in the soil (Zhu et al., 2015), thereby raising ably because of the longer lag time for peak nitrate
leaching in the 1980s in areas with a thick unsaturated

Vol:. (1234567890)
Environ Geochem Health (2024) 46:482 Page 15 of 19 482

zone, which has arrived at groundwater table in the conditions (T and K) and land use (particularly arable
1990s in regions with a thinner unsaturated zone land and woodland) play a crucial role in predict-
(Wang et al., 2013). Furthermore, due to limited data ing groundwater nitrate concentrations in the Eden
access, this study used long-term average values for Valley. Consequently, from the perspective of gen-
the unsaturated zone thickness. If data on the tem- esis analysis, nitrate spatial predictions from the SEL
poral dynamics of the unsaturated zone thickness or model are reliable. It is essential for water environ-
groundwater table become available, further research ment managers to formulate targeted strategies to
could explore their impacts on nitrate concentrations manage fertilizer application and manure storage,
in groundwater. especially in areas with high nitrogen loading and fast
groundwater flow.
Contributions of base models to the meta‑model

In the stacking model, the output from the base Conclusions


model was used as the input for the meta-model.
To assess the contribution of base models to the Nitrate is a widespread pollutant in groundwater,
meta-model, the importance of the base model threatening human health and environmental safety
was analyzed by employing SHAP. Based on the worldwide. This study developed a novel framework
mean absolute value of SHAP values, the five base for identifying the spatial pattern of groundwater
models at level 0 exhibited positive impacts on the nitrate concentration with high accuracy and quanti-
meta-model at level 1, with the following ranking: tatively analyzing the importance of key driving fac-
XGB > KNN > GBDT > RF > ET. tors. The results demonstrate that the proposed ISEL
In the SEL model, the average absolute values framework is effective in the Eden Valley. The SEL
of SHAP values of the outputs from both XGB and model improved predictive performance and gener-
KNN were nearly 0.12, higher than those of the other alization ability compared to the five individual ML
base models. It is likely because that the importance models (GBDT, XGB, RF, ET, KNN), providing reli-
rankings of the percentage of woodland in a 500 m able nitrate predictions. It was found that groundwa-
radius circular buffer in the XGB and KNN are higher ter nitrate concentrations in 2.46% of sandstone aqui-
(the third) compared to other base models (the fifth fers exceed the MAC of 11.3 mg/L-N, while 8.74%
or sixth), as shown in Fig. S4b and e. Conversely, the of areas with high nitrate concentrations have not
average absolute value of SHAP values of the output been delineated as the NVZs. SHAP analysis fur-
from the ET model was below 0.10, which was obvi- ther reveals that groundwater nitrate levels are sig-
ously lower than those of the other base models. This nificantly affected by aquifer characteristics, and land
may be associated with the percentage of arable land, use, with T identified as the most important factor in
which ranked tenth in the ET model (Fig. S4d) but in the SEL model. These findings can assist water envi-
the top five in the other four base models and in the ronmental managers in developing targeted pollution
SEL model. control strategies to ensure sustainable groundwa-
Furthermore, in the top three performing models ter quality management. This study marks the first
(i.e., GBDT, XGB, and RF), T was identified as the integration of the stacking technique with an inter-
most influential variable (Fig. S4a–c). Another vari- pretability approach in the field of groundwater con-
able related to aquifer characteristics, K, ranked in taminant. Future research directions include predict-
the top five in four of the base models, excluding the ing contaminant distribution across different spatial
GBDT. scales, modeling the spatiotemporal dynamics of pol-
In conclusion, the contribution analysis of driving lutants and incorporating broader data sources such
factors to the final nitrate predictions, as well as the as remote sensing. Overall, the proposed framework
impacts of the base models, suggests that the effects offers a promising way to accurately predicting con-
of hydrogeology, hydrology, land use, soil organic taminants distribution and clarifying complex envi-
matter, and elevation in this study are consistent ronmental phenomena, thereby contributing to sus-
with previous findings (Aller et al., 1987; Butcher tainable development.
et al., 2003). The results reveal that hydrogeological

Vol.: (0123456789)
482 Page 16 of 19 Environ Geochem Health (2024) 46:482

Acknowledgements The authors publish with the permis- Eden DTC sub-catchments (Report No. OR/10/063).
sion of the Executive Director of the British Geological Sur- British Geological Survey. https://​nora.​nerc.​ac.​uk/​id/​
vey (UKRI/NERC). This research was supported by the Brit- eprint/​12788/1/​OR100​63.​pdf
ish Geological Survey via NERC national capability and the Aller, L., Bennett, T., Lehr, J. H., Petty, R. J., & Hackett, G.
NSFC. We would like to thank anonymous reviewers for their (1987). DRASTIC: A standardized system for evaluat-
valuable comments. ing ground water pollution potential using hydrogeo-
logic settings (Report No. EPA600287035). US Envi-
Author contributions XL contributed to conceptualiza- ronmental Protection Agency. https://​cfpub.​epa.​gov/​si/​
tion, methodology, software, and writing—original draft; GHL ntisl​ink.​cfm?​dirEn​tryID=​35474
was involved in conceptualization, formal analysis, and fund- Bai, X., Jiang, Y., Miao, H., Xue, S., Chen, Z., & Zhou, J.
ing acquisition; LW was responsible for methodology, data (2021). Intensive vegetable production results in high
curation, writing—review & editing, funding acquisition, and nitrate accumulation in deep soil profiles in China. Envi‑
supervision. YSY participated in writing—review & editing, ronmental Pollution, 287, 117598. https://​doi.​org/​10.​
supervision, and funding acquisition; YYL helped with data 1016/j.​envpol.​2021.​117598
curation and validation. ZGL was involved in software and vis- Ballabio, C., Lugato, E., Fernández-Ugalde, O., Orgiazzi, A.,
ualization; BH participated in software and visualization; GLW Jones, A., Borrelli, P., Montanarella, L., & Panagos, P.
was responsible for resources and supervision. (2019). Mapping LUCAS topsoil chemical properties
at European scale using Gaussian process regression.
Funding This research was funded by the British Geologi- Geoderma, 355, 113912. https://​doi.​org/​10.​1016/j.​geode​
cal Survey via NERC national capability and the NSFC grants rma.​2019.​113912
(Nos. 42277189, 51779030). Ballabio, C., Panagos, P., & Monatanarella, L. (2016). Map-
ping topsoil physical properties at European scale using
Data availability No datasets were generated or analysed the LUCAS database. Geoderma, 261, 110–123. https://​
during the current study. doi.​org/​10.​1016/j.​geode​r ma.​2015.​07.​006
Barzegar, R., Razzagh, S., Quilty, J., Adamowski, J., Khey-
Declarations rollah Pour, H., & Booij, M. J. (2021). Improving
GALDIT-based groundwater vulnerability predictive
Conflict of interest The authors declare no competing inter- mapping using coupled resampling algorithms and
ests. machine learning models. Journal of Hydrology, 598,
126370. https://​doi.​org/​10.​1016/j.​jhydr​ol.​2021.​126370
Open Access This article is licensed under a Creative Com- Belitz, K., & Stackelberg, P. E. (2021). Evaluation of six
mons Attribution-NonCommercial-NoDerivatives 4.0 Interna- methods for correcting bias in estimates from ensemble
tional License, which permits any non-commercial use, shar- tree machine learning regression models. Environmental
ing, distribution and reproduction in any medium or format, as Modelling & Software, 139, 105006. https://​doi.​org/​10.​
long as you give appropriate credit to the original author(s) and 1016/j.​envso​ft.​2021.​105006
the source, provide a link to the Creative Commons licence, Bergstra, J., Bardenet, R. E. M., Bengio, Y., & Kégl, B.
and indicate if you modified the licensed material. You do (2011). Algorithms for hyper-parameter optimization.
not have permission under this licence to share adapted mate- In 24th International Conference on Neural Information
rial derived from this article or parts of it. The images or other Processing Systems (NIPS 2011), Red Hook, NY, USA.
third party material in this article are included in the article’s https://​doi.​org/​10.​5555/​29864​59.​29867​43
Creative Commons licence, unless indicated otherwise in a Bergstra, J., Komer, B., Eliasmith, C., Yamins, D., & Cox, D.
credit line to the material. If material is not included in the arti- D. (2015). Hyperopt: A Python library for model selec-
cle’s Creative Commons licence and your intended use is not tion and hyperparameter optimization. Computational
permitted by statutory regulation or exceeds the permitted use, Science & Discovery, 8(1), 14008. https://​doi.​org/​10.​
you will need to obtain permission directly from the copyright 1088/​1749-​4699/8/​1/​014008
holder. To view a copy of this licence, visit http://​creat​iveco​ BGS. (2020). BGS geology 50k (DigMapGB-50). British
mmons.​org/​licen​ses/​by-​nc-​nd/4.​0/. Geological Survey. https://​www.​bgs.​ac.​uk/​datas​ets/​bgs-​
geolo​gy-​50k-​digma​pgb/
Böhlke, J. (2002). Groundwater recharge and agricultural
contamination. Hydrogeology Journal, 10(1), 153–179.
https://​doi.​org/​10.​1007/​s10040-​001-​0183-3
References Boorman, D. B., Hollis, J. M., & Lilly, A. (1995). Hydrology
of soil types: a hydrologically based classification of the
Allen, D. J., Brewerton, L. J., Coleby, L. M., Gibbs, B. R., soils of the United Kingdom (Report No. 126). Institute
Lewis, M. A., MacDonald, A. M., Wagstaff, S. J., & of Hydrology. https://​nora.​nerc.​ac.​uk/​id/​eprint/​7369/1/​
Williams, A. T. (1997). The physical properties of major IH_​126.​pdf
aquifers in England and wales (Report No. WD/97/34). Booth, M. S., Stark, J. M., & Rastetter, E. (2005). Controls on
British Geological Survey. https://​nora.​nerc.​ac.​uk/​id/​ nitrogen cycling in terrestrial ecosystems: A synthetic
eprint/​13137/1/​WD970​34.​pdf analysis of literature data. Ecological Monographs, 75(2),
Allen, D. J., Newell, A. J., & Butcher, A. S. (2010). Pre- 139–157. https://​doi.​org/​10.​1890/​04-​0988
liminary review of the geology and hydrogeology of the

Vol:. (1234567890)
Environ Geochem Health (2024) 46:482 Page 17 of 19 482

Butcher, A. S., Lawrence, A. R., Jackson, C., Cunningham, J., Journal, 15(4), 531–538. https://​doi.​org/​10.​1002/​sam.​
Cullis, E., Hasan, K., & Ingram, J. (2003). Investigation of 11583
rising nitrate concentrations in groundwater in the Eden Kaur, L., Rishi, M. S., & Siddiqui, A. U. (2020). Determinis-
Valley, Cumbria: Phase 1 project scoping study (Report tic and probabilistic health risk assessment techniques to
No. NC/00/24/14). UK Environment Agency. https://​ evaluate non-carcinogenic human health risk (NHHR) due
aquad​ocs.​org/​handle/​1834/​27237 to fluoride and nitrate in groundwater of Panipat, Haryana
Castaldo, G., Visser, A., Fogg, G. E., & Harter, T. (2021). India. Environmental Pollution, 259, 113711. https://​doi.​
Effect of groundwater age and recharge source on nitrate org/​10.​1016/j.​envpol.​2019.​113711
concentrations in domestic wells in the San Joaquin Val- Knoll, L., Breuer, L., & Bach, M. (2019). Large scale predic-
ley. Environmental Science & Technology, 55(4), 2265– tion of groundwater nitrate concentrations from spatial
2275. https://​doi.​org/​10.​1021/​acs.​est.​0c030​71 data using machine learning. Science of the Total Environ‑
Chen, T., & Guestrin, C. (2016). XGBoost: a scalable tree ment, 668, 1317–1327. https://​doi.​org/​10.​1016/j.​scito​tenv.​
boost system. In: Proceedings of the 22nd ACM SIGKDD 2019.​03.​045
International Conference on Knowledge Discovery and Kuhn, M., & Johnson, K. (2013). Applied Predictive Mod‑
Data Mining, Los Angeles. https://​doi.​org/​10.​1145/​29396​ eling. New York: Springer. https://​doi.​org/​10.​1007/​
72.​29397​85 978-1-​4614-​6849-3
Dise, N. B., & Wright, R. F. (1995). Nitrogen leaching from Li, L., Qiao, J., Yu, G., Wang, L., Li, H., Liao, C., & Zhu, Z.
European forests in relation to nitrogen deposition. For‑ (2022). Interpretable tree-based ensemble model for pre-
est Ecology and Management, 71(1), 153–161. https://​doi.​ dicting beach water quality. Water Research, 211, 118078.
org/​10.​1016/​0378-​1127(94)​06092-W https://​doi.​org/​10.​1016/j.​watres.​2022.​118078
EA. (2012). Open water quality archive datasets. 2022-9-3, Liu, S., Zheng, T., Li, Y., & Zheng, X. (2023). A critical
from https://​envir​onment.​data.​gov.​uk/​water-​quali​ty/​view/​ review of the central role of microbial regulation in the
downl​oad/​new nitrogen biogeochemical process: New insights for con-
EA. (2021). Nitrates: Challenges for the water environment. trolling groundwater nitrogen contamination. Journal of
2023-2-25, from https://​www.​gov.​uk/​gover​nment/​publi​ Environmental Management, 328, 116959. https://​doi.​org/​
catio​ns/​nitra​tes-​chall​enges-​for-​the-​water-​envir​onment 10.​1016/j.​jenvm​an.​2022.​116959
EU. (1991). Council directive concerning the protection of Lu, M., Hou, Q., Qin, S., Zhou, L., Hua, D., Wang, X., &
waters against pollution caused by nitrates from agricul- Cheng, L. (2023). A stacking ensemble model of vari-
tural sources (91/676/EEC) (Report No. Official Journal ous machine learning models for daily runoff forecasting.
L375). Council of the European Communities. https://​eur-​ Water, 15(7), 1265. https://​doi.​org/​10.​3390/​w1507​1265
lex.​europa.​eu/​legal-​conte​nt/​EN/​TXT/​PDF/?​uri=​CELEX:​ Lundberg, S. M., & Lee, S. (2017). A unified approach to inter-
31991​L0676​&​from=​EN preting model predictions. In 31st Conference on Neu-
Gu, J., Liu, S., Zhou, Z., Chalov, S. R., & Qi, Z. (2022). A ral Information Processing Systems (NIPS 2017), Long
stacking ensemble learning model for monthly rainfall Beach, CA, USA. https://​doi.​org/​10.​5555/​32952​22.​32952​
prediction in the Taihu Basin China. Water, 14(3), 492. 30
https://​doi.​org/​10.​3390/​w1403​0492 Lundberg, S. M., Erion, G., Chen, H., DeGrave, A., Prutkin,
Hoang, H. G., Thuy, B. T. P., Lin, C., Vo, D. N., Tran, H. T., J. M., Nair, B., Katz, R., Himmelfarb, J., Bansal, N., &
Bahari, M. B., Le, V. G., & Vu, C. T. (2022). The nitro- Lee, S. (2020). From local explanations to global under-
gen cycle and mitigation strategies for nitrogen loss dur- standing with explainable AI for trees. Nature Machine
ing organic waste composting: A review. Chemosphere, Intelligence, 2(1), 56–67. https://​doi.​org/​10.​1038/​
300, 134514. https://​doi.​org/​10.​1016/j.​chemo​sphere.​2022.​ s42256-​019-​0138-9
134514 Mahlknecht, J., Torres-Martínez, J. A., Kumar, M., Mora, A.,
IAHS. (2023). Groundwater – more about the hidden resource. Kaown, D., & Loge, F. J. (2023). Nitrate prediction in
2023/06/01, from https://​iah.​org/​educa​tion/​gener​al-​public/​ groundwater of data scarce regions: The futuristic fresh-
groun​dwater-​hidden-​resou​rce water management outlook. Science of the Total Environ‑
Iqbal, J., Su, C., Ahmad, M., Baloch, M. Y. J., Rashid, A., ment, 905, 166863. https://​doi.​org/​10.​1016/j.​scito​tenv.​
Ullah, Z., Abbas, H., Nigar, A., Ali, A., & Ullah, A. 2023.​166863
(2023). Hydrogeochemistry and prediction of arsenic Mainali, J., Chang, H., & Chun, Y. (2019). A review of spatial
contamination in groundwater of Vehari, Pakistan: Com- statistical approaches to modeling water quality. Progress
parison of artificial neural network, random forest and in Physical Geography: Earth and Environment, 43(6),
logistic regression models. Environmental Geochem‑ 801–826. https://​doi.​org/​10.​1177/​03091​33319​852003
istry and Health, 46(1), 14. https://​doi.​org/​10.​1007/​ Morshed-Bozorgdel, A., Kadkhodazadeh, M., Valikhan Ana-
s10653-​023-​01782-7 raki, M., & Farzin, S. (2022). A novel framework based on
Jang, E., He, W., Savoy, H., Dietrich, P., Kolditz, O., Rubin, the stacking ensemble machine learning (SEML) method:
Y., Schüth, C., & Kalbacher, T. (2017). Identifying the Application in wind speed modeling. Atmosphere. https://​
influential aquifer heterogeneity factor on nitrate reduction doi.​org/​10.​3390/​atmos​13050​758
processes by numerical simulation. Advances in Water Morton, R. D., Rowland, C. S., Wood, C. M., Meek, L.,
Resources, 99, 38–52. https://​doi.​org/​10.​1016/j.​advwa​tres.​ Marston, C. G., & Smith, G. M. (2014). Land cover map
2016.​11.​007 2007 (25m raster, GB) v1.2. NERC Environmental Infor‑
Joseph, V. R. (2022). Optimal ratio for data splitting. Statisti‑ mation Data Centre. https://​doi.​org/​10.​5285/​a1f88​807-​
cal Analysis and Data Mining: THe ASA Data Science 4826-​44bc-​994d-​a902d​a5119​c2

Vol.: (0123456789)
482 Page 18 of 19 Environ Geochem Health (2024) 46:482

Musacchio, A., Re, V., Mas-Pla, J., & Sacchi, E. (2020). EU Geochemistry and Health, 46(3), 80. https://​doi.​org/​10.​
Nitrates Directive, from theory to practice: Environmental 1007/​s10653-​023-​01845-9
effectiveness and influence of regional governance on its Sardar, M. F., Younas, F., Farooqi, Z. U. R., & Li, Y. (2023).
performance. Ambio, 49(2), 504–516. https://​doi.​org/​10.​ Soil nitrogen dynamics in natural forest ecosystem: a
1007/​s13280-​019-​01197-8 review. Frontiers in Forests and Global Change. https://​
Nadiri, A. A., Bordbar, M., Nikoo, M. R., Silabi, L. S. S., Sen- doi.​org/​10.​3389/​ffgc.​2023.​11449​30
apathi, V., & Xiao, Y. (2023). Assessing vulnerability of Shams, R., Alimohammadi, S., & Yazdi, J. (2021). Optimized
coastal aquifer to seawater intrusion using Convolutional stacking, a new method for constructing ensemble sur-
Neural Network. Marine Pollution Bulletin, 197, 115669. rogate models applied to DNAPL-contaminated aquifer
https://​doi.​org/​10.​1016/j.​marpo​lbul.​2023.​115669 remediation. Journal of Contaminant Hydrology, 243,
Nearing, G. S., Kratzert, F., Sampson, A. K., Pelissier, C. S., 103914. https://​doi.​org/​10.​1016/j.​jconh​yd.​2021.​103914
Klotz, D., Frame, J. M., Prieto, C., & Gupta, H. V. (2021). Sheng, S., Liu, B., Hou, X., Liang, Z., Sun, X., Du, L., &
What role does hydrological science play in the age of Wang, D. (2018). Effects of different carbon sources and
machine learning? Water Resources Research, 57(3), C/N ratios on the simultaneous anammox and denitrifica-
e2020W-e28091W. https://​doi.​org/​10.​1029/​2020W​R0280​ tion process. International Biodeterioration & Biodegra‑
91 dation, 127, 26–34. https://​doi.​org/​10.​1016/j.​ibiod.​2017.​
Nolan, B. T., Gronberg, J. M., Faunt, C. C., Eberts, S. M., & 11.​002
Belitz, K. (2014). Modeling nitrate at domestic and pub- Van Rossum, G., & Drake, F. L. (2009). Python 3 Reference
lic-supply well depths in the Central Valley, California. Manual. Scotts Valley, CA, US: CreateSpace Independ-
Environmental Science & Technology, 48(10), 5643– ent Publishing Platform. https://​api.​seman​ticsc​holar.​org/​
5651. https://​doi.​org/​10.​1021/​es405​452q Corpu​sID:​61259​041
Met Office, Hollis, D., McCarthy, M., Kendon, M., Legg, T., Vanguelova, E., Pitman, R., & Benham, S. (2024). Responses
& Simpson, I. (2018). HadUK-Grid gridded and regional of forest ecosystems to nitrogen deposition in the United
average climate observations for the UK. Centre for Envi- Kingdom. In E. Du & W. D. Vries (Eds.), Atmospheric
ronmental Data Analysis. 2023-08-04. http://​catal​ogue.​ nitrogen deposition to global forests (pp. 183–203). Aca-
ceda.​ac.​uk/​uuid/​4dc84​50d88​9a491​ebb20​e724d​ebe2d​fb demic Press.
Pedregosa, F., Varoquaux, G. E. L., Gramfort, A., Michel, Walker, D., Parkin, G., Schmitter, P., Gowing, J., Tilahun, S.
V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., A., Haile, A. T., & Yimam, A. Y. (2019). Insights from
Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cour- a multi-method recharge estimation comparison study.
napeau, D., Brucher, M., Perrot, M., & Duchesnay, E. D. Groundwater, 57(2), 245–258. https://​doi.​org/​10.​1111/​
(2011). Scikit-learn: Machine learning in Python. Jour‑ gwat.​12801
nal of Machine Learning Research, 12(85), 2825–2830. Wang, L., & Burke, S. P. (2017). A catchment-scale method to
https://​doi.​org/​10.​5555/​19530​48.​20781​95 simulating the impact of historical nitrate loading from
Picetti, R., Deeney, M., Pastorino, S., Miller, M. R., Shah, A., agricultural land on the nitrate-concentration trends in
Leon, D. A., Dangour, A. D., & Green, R. (2022). Nitrate the sandstone aquifers in the Eden Valley, UK. Science of
and nitrite contamination in drinking water and cancer the Total Environment, 579, 133–148. https://​doi.​org/​10.​
risk: A systematic review with meta-analysis. Environ‑ 1016/j.​scito​tenv.​2016.​10.​235
mental Research, 210, 112988. https://​doi.​org/​10.​1016/j.​ Wang, L., Butcher, A. S., Stuart, M. E., Gooddy, D. C., &
envres.​2022.​112988 Bloomfield, J. P. (2013). The nitrate time bomb: A
Ransom, K. M., Nolan, B. T., Stackelberg, P. E., Belitz, K., numerical way to investigate nitrate storage and lag time
& Fram, M. S. (2022). Machine learning predictions of in the unsaturated zone. Environmental Geochemistry
nitrate in groundwater used for drinking supply in the and Health, 35(5), 667–681. https://​doi.​org/​10.​1007/​
conterminous United States. Science of the Total Envi‑ s10653-​013-​9550-y
ronment, 807, 151065. https://​doi.​org/​10.​1016/j.​scito​tenv.​ Wang, L., Stuart, M. E., Bloomfield, J. P., Butcher, A. S.,
2021.​151065 Gooddy, D. C., McKenzie, A. A., Lewis, M. A., & Wil-
Richards, J., Chambers, T., Hales, S., Joy, M., Radu, T., Wood- liams, A. T. (2012). Prediction of the arrival of peak
ward, A., Humphrey, A., Randal, E., & Baker, M. G. nitrate concentrations at the water table at the regional
(2022). Nitrate contamination in drinking water and colo- scale in Great Britain. Hydrological Processes, 26(2),
rectal cancer: Exposure assessment and estimated health 226–239. https://​doi.​org/​10.​1002/​hyp.​8164
burden in New Zealand. Environmental Research, 204, Wang, L., Zhu, Z., Sassoubre, L., Yu, G., Liao, C., Hu, Q., &
112322. https://​doi.​org/​10.​1016/j.​envres.​2021.​112322 Wang, Y. (2021). Improving the robustness of beach water
Rivett, M. O., Buss, S. R., Morgan, P., Smith, J. W. N., & Bem- quality modeling using an ensemble machine learning
ment, C. D. (2008). Nitrate attenuation in groundwater: approach. Science of the Total Environment, 765, 142760.
A review of biogeochemical controlling processes. Water https://​doi.​org/​10.​1016/j.​scito​tenv.​2020.​142760
Research, 42(16), 4215–4232. https://​doi.​org/​10.​1016/j.​ Wheeler, D. C., Nolan, B. T., Flory, A. R., DellaValle, C. T., &
watres.​2008.​07.​020 Ward, M. H. (2015). Modeling groundwater nitrate con-
Sakizadeh, M., Zhang, C., & Milewski, A. (2024). Spa- centrations in private wells in Iowa. Science of the Total
tial distribution pattern and health risk of groundwa- Environment, 536, 481–488. https://​doi.​org/​10.​1016/j.​
ter contamination by cadmium, manganese, lead and scito​tenv.​2015.​07.​080
nitrate in groundwater of an arid area. Environmental WHO. (2022). Guidelines for Drinking-Water Quality: Fourth
Edition Incorporating the First and Second Addenda

Vol:. (1234567890)
Environ Geochem Health (2024) 46:482 Page 19 of 19 482

(fourth ed.). Geneva: World Health Organization. https://​ forest ecosystems: A review. Acta Ecologica Sinica, 35(3),
www.​who.​int/​publi​catio​ns/i/​item/​97892​40045​064 35–43. https://​doi.​org/​10.​1016/j.​chnaes.​2015.​04.​004
Yang, X., Hu, Z., Xie, Z., Li, S., Sun, X., Ke, X., & Tao, M. Zomlot, Z., Verbeiren, B., Huysmans, M., & Batelaan, O.
(2023). Low soil C: N ratio results in accumulation and (2015). Spatial distribution of groundwater recharge and
leaching of nitrite and nitrate in agricultural soils under base flow: Assessment of controlling factors. Journal of
heavy rainfall. Pedosphere, 33(6), 865–879. https://​doi.​ Hydrology: Regional Studies, 4, 349–368. https://​doi.​org/​
org/​10.​1016/j.​pedsph.​2023.​03.​010 10.​1016/j.​ejrh.​2015.​07.​005
Zhang, G., & Lu, Y. (2012). Bias-corrected random forests in
regression. Journal of Applied Statistics, 39(1), 151–160. Publisher’s Note Springer Nature remains neutral with regard
https://​doi.​org/​10.​1080/​02664​763.​2011.​578621 to jurisdictional claims in published maps and institutional
Zhang, J., Zhu, T., Meng, T., Zhang, Y., Yang, J., Yang, W., affiliations.
Müller, C., & Cai, Z. (2013). Agricultural land use affects
nitrate production and conservation in humid subtropical
soils in China. Soil Biology and Biochemistry, 62, 107–
114. https://​doi.​org/​10.​1016/j.​soilb​io.​2013.​03.​006
Zhu, X., Zhang, W., Chen, H., & Mo, J. (2015). Impacts
of nitrogen deposition on soil nitrogen cycle in

Vol.: (0123456789)

You might also like