0% found this document useful (0 votes)
25 views14 pages

Lam Ahmadian 2022 Predicting Fecal Indicator Organisms in Coastal Waters Using a Complex Nonlinear Artificial

Uploaded by

ARARSO BESHEA
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
25 views14 pages

Lam Ahmadian 2022 Predicting Fecal Indicator Organisms in Coastal Waters Using a Complex Nonlinear Artificial

Uploaded by

ARARSO BESHEA
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

Predicting Fecal-Indicator Organisms in Coastal Waters

Using a Complex Nonlinear Artificial Intelligence Model


Man-Yue Lam 1 and Reza Ahmadian 2
Downloaded from ascelibrary.org by "Indian Institute of Technology (BHU), Varanasi" on 01/08/25. Copyright ASCE. For personal use only; all rights reserved.

Abstract: High levels of fecal-indicator organisms (FIOs) at bathing water sites can cause disease and impose threats to public health. There is a
need for predicting FIO levels to inform the public and reduce exposure. Data-driven models are one of the main tools being considered as
predictive models. However, identifying the main inputs of the data-driven models is a major challenge in developing FIO predictor models. This
paper develops a data-driven model for FIO concentration prediction based on a limited number of critical input variables. Essential variables
were identified with be a combination of the gamma test and Genetic Algorithm (Gamma-GA test). Artificial neural networks (ANNs) and linear
regression models were developed using these two variable identification approaches for comparison. The models were applied to a case study,
and it was found that the model using the Gamma-GA test has a high potential to predict FIO levels more accurately, although this requires
further investigation with different case studies. A correlation analysis was required prior to the variable identification approaches in this study.
The need of this analysis highlights the significance of understanding the waterbody and the data set in the development and application of data-
driven models. Models using a Gamma-GA test were more capable of predicting extreme (high) FIO concentrations, making a Gamma-GA test
more suitable for a bathing water quality early warning system. The importance of nonlinearity in such predictive models was also demonstrated
by the better performance of nonlinear ANN models compared with linear regression models regardless of the variable identification approaches
used. This paper highlights the importance of nonlinearity in bathing water quality prediction and encourages further utilization of nonlinear
models for this application. DOI: 10.1061/JOEEDU.EEENG-6986. © 2022 American Society of Civil Engineers.

Introduction Although field sampling and analysis are important, they do not
provide predictions to impending bathing water quality.
Waterborne pathogens in waterbodies cause illnesses such as gastro- Two- and three-dimensional hydroenvironmental models are
intestinal infections, eye infections, skin complaints, and nose and commonly applied to assess FIO concentrations in waterbodies.
throat infections (Pruss 1998; Pandey et al. 2014). Fecal-indicator These models numerically solve the mass and momentum equa-
organisms (FIOs), e.g., E. coli and Enterococci, are commonly used tions of fluids as well as the fate and transport of FIOs, including
to indicate the level of pathogens in waterbodies (Dufour 1984; decay and interaction with sediment. These models have been ap-
Pandey et al. 2014). In Europe, the European Union (EU)’s revised plied in a wide range of studies (e.g., Lee and Qu 2004; Lin et al.
Bathing Water Directive (rBWD) (European Commission 2006) 2008; Schippmann et al. 2013; Huang et al. 2017; Abu-Bakar et al.
requires member states to monitor at least the concentrations of 2017) to provide relatively accurate predictions of the spatial and
two FIO species in designated bathing waters for compliance. The temporal concentration distributions of FIOs. Nevertheless, these
rBWD recognizes short-term occasional pollution and includes pro- models require detailed knowledge of flow and FIOs at the boun-
visions for discounting compliance requirements when there is a dary of the modeling domain, which are generally very expensive
predictive and warning system to alarm the public of impending and time-consuming to acquire. Moreover, such models are usu-
poor water quality. ally computationally demanding and require a long run time even
Traditionally, FIO concentrations in bathing water samples are on modern computers. Therefore, using such models in real-time
determined by culture-based methods. These methods require a min- as a part of early warning systems is not practical.
imum of an 18–24 h (USEPA 2010) laboratory assay. However, FIO Data-driven models are promising alternatives in providing
concentrations change continuously (Boehm et al. 2002; Whitman timely predictions of FIOs for bathing water quality warning systems
et al. 2004; Kim et al. 2004; King et al. 2021), causing culture-based due to their lower computational requirements. Such models utilize
warning systems to give outdated water quality alerts. More rapid data obtained by environmental sensors to predict FIO concentra-
FIO analysis methods such as quantitative polymerize chain reaction tions in bathing waters. The public could then be warned about
(qPCR) can determine FIO concentrations in under 6 h, but these occasions with high FIO concentrations. The development of such
methods require significant up-front investments and trained person- data-driven models requires identifying FIO predictive variables
nel (Zhang et al. 2018) and still cannot be used as a predictive tool. and establishing relationships between these variables and FIO
concentrations. These two steps have been previously conducted
1
Research Associate, School of Engineering, Cardiff Univ., Cardiff mainly by stepwise multilinear regression (MLR) (e.g., Crowther
CF24 3AA, UK (corresponding author). ORCID: https://orcid.org/0000 et al. 2001; Nevers and Whitman 2005; Gonzalez et al. 2012; Wyer
-0001-7259-968X. Email: [email protected] et al. 2013b; Gonzalez and Noble 2014). In a stepwise MLR, var-
2
Professor, School of Engineering, Cardiff Univ., Cardiff CF24 3AA, iables are included or excluded in a linear regression equation in
UK. ORCID: https://orcid.org/0000-0003-2665-4734
Note. This manuscript was submitted on April 13, 2022; approved on
a stepwise manner. Such inclusion or exclusion is decided by the
October 5, 2022; published online on December 6, 2022. Discussion per- influence of the input variables on estimating the target variables
iod open until May 6, 2023; separate discussions must be submitted for through linear regression analysis. This linear approach does not
individual papers. This paper is part of the Journal of Environmental account for possible nonlinear relationships that could affect
Engineering, © ASCE, ISSN 0733-9372. predicting extremes.

© ASCE 04022093-1 J. Environ. Eng.

J. Environ. Eng., 2023, 149(2): 04022093


A promising nonlinear approach is an artificial neural network
(ANN), in which the predictive variables and FIO concentrations
are linked by simplified yet nonlinear network-like models (Masters
1993; Garrett 1994; Russell and Norvig 2010). ANN has been ap-
plied to predict FIO concentrations in several studies (e.g., Jin and
Englande 2006; He and He 2008; Zhang et al. 2012; Thoe et al.
2015; Zhang et al. 2018). ANN models have shown better perfor-
mance in predicting extreme FIO concentration (both high and
low) (Zhang et al. 2012) and give higher sensitivity to poor-
water-quality events (Thoe et al. 2015) compared with stepwise
Downloaded from ascelibrary.org by "Indian Institute of Technology (BHU), Varanasi" on 01/08/25. Copyright ASCE. For personal use only; all rights reserved.

MLR. However, ANN models cannot identify predictive variables


from a data set; manual selection of predictive variables is highly
dependent on the users’ judgement. This could be challenging in
data-rich sites, which will become more common as a result of en-
hancements in sensor and implementation of digital environments.
Although stepwise MLR can be conducted to identify predictive var-
iables prior to ANN, stepwise MLR is not capable of identifying
nonlinear relationships between variables and FIO concentrations.
A nonlinear alternative method to identify predictive variables is
the gamma test. Gamma test determines the significance of a set of
input variables in predicting the target data, e.g., FIO concentrations,
by quantifying the residue variance that cannot be explained by any
smooth nonlinear models (Jones 2004). Gamma tests do not require
an assumed nonlinear function relating input variables and target
data a priori. On the other hand, the gamma test does not determine
the nonlinear model itself. To identify predictive variables in a data
set, gamma tests can be applied to each possible combination of
variables and choose the best combination to be the one that gives
the smallest residue variance. However, for data sets containing
large number of variables, searching the entire input combination Fig. 1. Flowchart for the modeling approach.
space requires many gamma test computations and large computa-
tional power. A genetic algorithm (GA) model can be utilized to
circumvent the need for such high computational power (Jones
2004). Although a cross-validation approach (Stone 1974) may be
applied for nonlinear variable identification, the approach usually available variables. To identify linearly independent variables among
requires a priori regression equations or network architectures and all available variables, a singular value decomposition–based col-
can be computation intensive for large data sets because of the in- linearity analysis was conducted. The output variables from the
creased number of networks needed for cross-comparison. analysis became the input variables for the Gamma-GA tests or
This paper developed a data-driven model for FIO concentration stepwise MLRs. ANN models were developed from predictive var-
prediction based on a limited number of critical input variables. iables identified with Gamma-GA tests (GG-ANN) and stepwise
Essential variables were identified with be a combination of gamma MLRs (SL-ANN), respectively. Linear regression models were also
test and GA (Gamma-GA test). This approach was compared with developed from the identified variables (GG-Linear and SL-Linear
the stepwise MLR, which has been commonly applied in water models). Fig. 1 shows a flowchart of the modeling approach in this
quality prediction (Crowther et al. 2001; Nevers and Whitman 2005; paper, and the following sections give further details of the afore-
Wyer et al. 2013b; Thoe and Lee 2014). From the variables iden- mentioned tests and models.
tified by the Gamma-GA test and stepwise MLR, ANN and linear
regression models were developed and evaluated. Collinearity Analysis
These techniques were applied to a data-rich test site, namely
Linear correlation may exist among the variables within the mea-
Swansea Bay, UK, where a significant amount of FIO and environ-
sured data set. This is referred to as collinearity (Belsley et al. 1980).
mental data were collected. This is the first time a gamma test has
Correlation analysis was conducted in this paper to remove re-
been applied to identify FIO predictive variables at bathing water
dundant variables. The correlation coefficients between variable
sites. Gamma testing has been applied by Kashefipour et al. (2005)
pairs were computed. When the correlation coefficient was high
and Lin et al. (2008) in bathing water quality modeling, but the test
(e.g., >0.6), one of the variables was removed; the variable to
was not used for variable identification. The test has also been ap-
remove was selected by mechanistic-process-based judgement.
plied by Choubin and Malekian (2017) and Ghaderi et al. (2019)
Correlation analysis was also conducted to determine the lag-time
for variable identification, but their focuses were not bathing water
required for the time-lagged variables not to have a high correla-
quality. The improvements in predicting FIO concentrations using
tion with the original unlagged variables.
the complex model proposed in this study at the case study site are
highlighted.
Gamma Test and Genetic Algorithm
Methodology Gamma tests determine the part of the variance of target data that
cannot be accounted for by any smooth nonlinear models. Never-
The first step in developing data-driven models is identifying the theless, a gamma test does not determine the model itself. The
input variables. However, there may be dependencies among the gamma test is briefly explained subsequently but further details

© ASCE 04022093-2 J. Environ. Eng.

J. Environ. Eng., 2023, 149(2): 04022093


can have been given by Stefansson et al. (1997), Evans and Jones 1 XM
(2002), and Jones (2004). Consider a data set of input variables (the γðkÞ ¼ ðy − yi Þ2 ð9Þ
2M i¼1 N½i;k
independent environmental variables selected by the collinearity
test in this paper) X and target data (FIO concentrations in this PM
paper) y as follows: 2
i¼1 ½ðxN½i;k − xi Þ · ∇f
AðkÞ ¼ PM ð10Þ
2 3 i¼1 jxN½i;k − xi j
2
x11 x21 · · · xN1
6 7
6 x12 x22 · · · xN2 7
X¼6 .
6 7
ð1Þ 1 XM

6 .. .. .. .. 7
7 Γ¼ ðr − r1 Þ2 ¼ VarðrÞ ð11Þ
4 . . . 5 2M i¼1 2
Downloaded from ascelibrary.org by "Indian Institute of Technology (BHU), Varanasi" on 01/08/25. Copyright ASCE. For personal use only; all rights reserved.

x1M x2M ··· xNM


where j · j = Euclidean distance; and yN½i;k = target value associated
2 3 with xN½i;k (yN½i;k is not necessarily the kth nearest point to yi in the
y1
6 7 data set). Evans and Jones (2002) showed that AðkÞ ¼ A is a con-
6 y2 7 stant given M is sufficiently large. Eq. (7) becomes
6 7
y¼6 .. 7 ð2Þ
6 . 7 γðkÞ ¼ AδðkÞ þ Γ ð12Þ
4 5
yM
Both A and Γ in Eq. (12) can be obtained by conducting linear
regression with γðkÞ and δðkÞ computed from the kth nearest points
where xji ¼ jth input variable at time i (1 ≤ i ≤ M); and yi = target
to xi for i ¼ 1; : : : ; M and k ¼ 1; : : : ; p, where p is the maximum
data at time i. The row vectors of matrix X, that is
value of k used. In this study, p ¼ 10 as suggested by Jones (2004).
xi ¼ ½x1i ; x2i ; : : : ; xNi  ð3Þ Then, xN½i;k for each xi in γðkÞ and δðkÞ are obtained by an effi-
cient k-dimensional tree approach (computational time in the order
where superscript T = matrix transpose, are the environmental var- of M log M) from Bentley (1975).
iable data points measured at time instant i (1 ≤ i ≤ M). Assume Gamma tests need to be applied to all possible combinations of
that xi and yi are related as follows: input variables in order to identify the strongest predictive variables
from all available data. This is the combination that gives the lowest
yi ¼ fðxi Þ þ r ð4Þ absolute value of Γ (i.e., the combination of input variables that
gives the smallest noise variance). However, this approach is com-
where f = nonlinear and smooth function; and r = random variable putationally demanding for large data sets; the number of possible
(i.e., noise that is excluded from the input–target relationship). We combinations for a data set of m variables is 2m − 1. To circumvent
define an imaginary data point xi0 near xi, and then the need for high computational power, the variable selection prob-
lem is expressed as a minimization problem, which is solved using
1 XM
GA (Jones 2004). The combination of input variables that mini-
γ¼ ½yðxi0 Þ − yðxi Þ2 ð5Þ
2M i¼1 mizes jΓj is selected as the solution. Combinations of input vari-
ables in the GA model are represented by a binary vector of length
substitute Eq. (4) into Eq. (5) and consider the Taylor expansion N (a mask) in which the inclusion or exclusion of a variable is
fðxi0 Þ ¼ fðxi Þ þ ðxi0 − xi Þ · ∇f þ Oðjxi0 − xi j2 Þ, where ∇ is the indicated by 1 or 0, respectively. In this work, the GA function
gradient operator. Eq. (5) becomes ga in MATLAB version R2019b Global Optimization Toolbox
(Mathworks 2020b) was used. Readers are also referred to Deb
1 XM
1 XM (2000) and Deep et al. (2009) for details of the GA approach.
γ¼ ½ðxi0 − xi Þ · ∇f2 þ ðr − r1 Þ2 ð6Þ The M-test (Jones 2004) can be used to determine the minimum
2M i¼1 2M i¼1 2
required length of the input and target data set, the value of M in
Eq. (1). This value of M is also the minimum data length required
where r1 and r2 = two realizations of the random variable r cor- for model training if a nonlinear model is applied to the data set.
responding to yðxi Þ and yðxi0 Þ, respectively. ItPis obvious from In an M-test, gamma tests are conducted sequentially with pro-
Eq. (6) that if xi0 → xi , then γ → ð1=2MÞ M 2
i¼1 ðr2 − r1 Þ ¼ gressively increasing M. The computed jΓj is plotted against data
VarðrÞ where VarðrÞ is the variance of r in probability. VarðrÞ is length. The minimum required M is the value of M beyond which
obtained without knowing the expression of f. jΓj becomes constant. M-test results are expected to be different
The data point xi0 that is arbitrarily close to xi does not exist in when the ordering of the data is different. In this work, the data
the measured data set X. An approach to estimate the first term in order was randomly generated and three different realizations
the right-hand side in Eq. (6) is required to obtain VarðrÞ from γ. (Realizations 1, 2, and 3) of data in different order was tested to
In the gamma test, xi0 is replaced by xN½i;k, the kth nearest data reduce reliance on the order of the data.
point to xi . For example, xN½i;k¼1 and xN½i;k¼2 are the nearest and
the second nearest points to xi . With this replacement, Eq. (6)
becomes Artificial Neural Network
Feedforward back-propagation ANN models were used to predict
γðkÞ ¼ AðkÞδðkÞ þ Γ ð7Þ
fecal-indicator organism concentrations, including Enterococci and
E. coli, based on the predictive variables selected by the Gamma-
where
GA tests. Each network consists of an input layer, an output layer,
and one hidden layer. One hidden layer suffices in this case be-
1X M
δðkÞ ¼ jx − x i j2 ð8Þ cause Masters (1993) showed that networks with one hidden layer
M i¼1 N½i;k are generally capable of approximating most underlying functions.

© ASCE 04022093-3 J. Environ. Eng.

J. Environ. Eng., 2023, 149(2): 04022093


The authors also tested GG-ANN and SL-ANN networks with two refrigerator before analysis. The samples were then analyzed for
hidden layers, and no significant improvement in performance was intestinal Enterococci and E. coli with standard membrane filtration
obtained compared with one-hidden-layer networks. The number techniques and analyzed for turbidity with a LP2000 bench turbidity
of nodes required in the hidden layer was determined by experi- meter (Hannah Instruments, Leighton Buzzard, UK). Salinity was
mentation to give the best results without overfitting. also measured with a conductivity meter (Model SevenGo, Mettler
The networks were trained and validated with the ANN function Toledo, Leicester, UK).
“train” in MATLAB version R2019b Deep Learning Toolbox Fig. 2(a) also shows the sampling locations of the environmental
(Mathworks 2020a). To retain a portion of the data for model val- variables in the SCSC project. Tide level and the flow rates at Riv-
idation and to avoid overtraining, the data set was divided to three ers Tawe, Neath, and Afan were measured by the existing gauges in
sets, namely the training, validation, and testing sets. For each of the the hydrometric monitoring network operated by Natural Resour-
Downloaded from ascelibrary.org by "Indian Institute of Technology (BHU), Varanasi" on 01/08/25. Copyright ASCE. For personal use only; all rights reserved.

three realizations in the M-test, the data were grouped into the three ces Wales (NRW), the official natural resource management organi-
sets according to the data order (i.e., the first to nth data were put into zation in Wales. The water depths and velocities at the five smaller
the training set; the n þ 1th to mth data were put into the validation streams were measured by OTT Orpheus Mini pressure transducers
set; the m þ 1th to the end of the data were put into the testing set; (OTT HydroMet, Kempten, Germany) and Sensa RC2 electromag-
n < m). The performance function used in this study was the mean netic velocity meters (Aqua Data, Oxfordshire, UK), respectively.
squared error (MSE) between model outputs and target data. Train- Global radiation, temperature, relative humidity, rainfall, and wind
ing of a network was stopped when no further improvement in MSE speed were measured at the meteorological station.
for the validation data set could be achieved after six iterations. Global radiation was measured by a SKS 1110 pyranometer
Although this method avoids parameter (weights and bias) over- (Skye Instruments, Llandrindod Wells, UK); air temperature and
training, it does not avoid overtraining due to overcomplicated relative humidity were measured by a HygroClip2 Hc2-S3 sensor
network architecture and redundant predictive variables. This fact (Rotronic, Crawley, UK); rainfall was measured with a 370C
highlights the importance of Gamma-GA test in rejecting redun- tipping-bucket rain gauge (20.3 cm aperture and 0.2 mm tip;
dant variables. For each network, 300 training runs with random Met One Instruments, Grants Pass, Oregon); Wind speed and di-
initial weights were conducted, and the network that provided rection were measured with a WindSonic anemometer (Gill Instru-
the minimum MSE was chosen. Iyer and Rhinehart (1999) showed ments, Hampshire, UK). Offshore wastewater discharge volumes
that the network obtained from this approach has a 95% confidence were also measured in the SCSC project but were not included
level that its MSE is within the lowest 1.0%. as potential model outputs because the tracer study conducted as
Linear regression models with their predictive variables selected a part of the SCSC project (Ahmadian et al. 2013) and the two-
by the Gamma-GA tests (GG-Linear models) and stepwise MLR dimensional TELEMAC hydrodynamic simulation conducted by
models (SL-Linear models) were also developed to assess ANN the authors suggested that they are not important for FIO concen-
model performance. trations at the DSPs compared with other FIO sources (not shown).
Tables 1 and 2 summarize the data set used in this paper. The
target variables were Enterococci and E. coli concentrations mea-
Model Application sured at the Bathing Water DSPs during a bathing season, namely
June 22 to September 28, 2011. The input data set included 16 envi-
The model was applied to Swansea Bay, located on the north of the ronmental variables measured in the same bathing season as indi-
Bristol Channel in the southwest of the UK, as shown in Fig. 2(a). cated in Table 2. The range of values of different input and target
Along the bay are two sandy beaches with bathing water status: variables were significantly different, as detailed in Tables 1 and 2,
the Swansea Beach and Aberavon Beach. Potential sources of FIO due to the large number of factors that affect bacteria concentra-
in the Bay are the discharges from rivers, streams, surface water tions. In order to ensure consistency between data and reduce the
drains, three offshore outfalls from wastewater treatment works, impact of variation ranges on the model, all the data have been nor-
and transport by currents from sources outside of the Bay. Large malized to the range of 0–1 using the following equation:
amounts of data were collected as a part of the previous Smart
xj − xj;min
Coast Sustainable Communities (SCSC) research project (Wyer xj;nor ¼ ð13Þ
et al. 2013b, 2018), which, alongside the variety of the sources, xj;max − xj;min
make the Bay an ideal case study for data-driven modeling. The
stream and drain discharges are generally low (<1 m3 =s); Rivers where xj and xj;nor = unnormalized and normalized jth variable;
Tawe, Clyne, Neath, and Afan have relatively high flow rates and xj;max and xj;min = maximum and minimum of the time series
(>5 m3 =s). The water is well-mixed in the Severn Estuary and xj before data processing and model training.
Bristol Channel (Uncles 1981; Evans et al. 1990; Ahmadian et al. Logarithmic transformation was applied to the variables that have
2013). FIO concentrations in the beaches are governed by the sour- relatively high skewness to transform them from a lognormallike dis-
ces and the hydrodynamics in the Bay (Ahmadian et al. 2013). tribution to a normal-like distribution. This transformation is neces-
The concentrations of two FIO species, namely E. coli and sary because stepwise MLR models assume normally distributed data
Enterococci, were sampled at various sources and receptors at high (skewness ¼ 0). If the data have a significant skewness, the stepwise
frequency, i.e., intervals of 15–30 min, in years 2011 and 2012. In MLR variable inclusion/exclusion procedures may not be suitable,
Swansea Beach, the large tidal range (exceeding 10 m) and sloping and subsequently the model would not result in good validation. Col-
beach results in a tidal flat exposed up to 1,500 m from shore during linearities among variables were identified, and redundant variables
high spring tides. The large extent of the tidal flats makes single- were removed. In order to build memory of the past conditions,
point FIO concentration measurement impossible. In the data collec- e.g., solar radiation or rain prior to the simulation, and time required
tion scheme, FIO concentrations were measured along a sampling for transport of bacteria across the bay, which could significantly af-
transect consisting of designated sampling points (DSPs) in Swansea fect the concentration of bacteria, time-lagged variables were also
Bay, as shown in Fig. 2(a). Fig. 2(b) shows the DSPs in the sampling considered as the input variables in the data-driven models. Correla-
transect in the 2011 bathing season. The samples were collected in tion analysis was conducted to identify collinear variables and deter-
sterile 1-L containers (Aurora Scientific, Bristol, UK) and stored in a mine and lag time that did not cause additional collinearity issues.

© ASCE 04022093-4 J. Environ. Eng.

J. Environ. Eng., 2023, 149(2): 04022093


Downloaded from ascelibrary.org by "Indian Institute of Technology (BHU), Varanasi" on 01/08/25. Copyright ASCE. For personal use only; all rights reserved.

Fig. 2. (a) Site layout and key sampling locations in Swansea Bay, UK (sources: Esri, DigitalGlobe, GeoEye, i-cubed, USDA FSA, USGS, AEX,
Getmapping, Aerogrid, IGN, IGP, swisstopo, and the GIS User Community); and (b) close-up of DSPs (circles) in the sampling transect in plot (a) in
the 2011 bathing season at 30-min intervals from 07:00 to 16:00 (image © Google, Image © 2022 TerraMetrics).

Table 1. Measured FIO concentrations during June 22–September 28, Results and Discussion
2011
Range after Input Variable Selection
transformation
Ln From the correlation analysis, 23 input variables were identified as
FIO variables transformation Minimum Maximum
shown in the Variables Identified from the Correlation Analysis
E. coli (cfu=100 mL) Yes 1.10 8.04 column in Table 3. Only one single representative streamflow,
Enterococci (cfu=100 mL) Yes 1.10 8.37 the flow of the Tawe River, was selected because flows at different
Note: Ln = natural logarithm. streams with no time lag were found highly correlated. Such a high

© ASCE 04022093-5 J. Environ. Eng.

J. Environ. Eng., 2023, 149(2): 04022093


Table 2. Measured environmental variables during June 22–September 28, 2011
Range after transformation
Variable type Variable Ln transformation Minimum Maximum
3
Streamflow data Washinghouse Brook (m =s) Yes −4.71 −0.03
Brockhole stream (m3 =s) Yes −5.30 −1.94
Clyne River (m3 =s) Yes −2.42 1.98
Brynmill stream (m3 =s) Yes −4.20 1.33
River Tawe (m3 =s) Yes 0.843 5.15
River Neath (m3 =s) Yes 0.642 5.03
River Afan (m3 =s)
Downloaded from ascelibrary.org by "Indian Institute of Technology (BHU), Varanasi" on 01/08/25. Copyright ASCE. For personal use only; all rights reserved.

Yes 0.298 4.23


Tidal data Normalized tide level at Mumbles No −0.499 0.483
2
Meteorological data Global radiation (W=m ) Yes −1.97 6.86
Temperature (°C) No 8.92 23.2
Relative humidity (%) No 34.3 99.0
Rainfall (mm) Yes −13.8 0.588
Wind speed to the north (m/s) No −11.6 4.14
Wind speed to the east (m/s) No −5.90 6.64
Water quality data Turbidity (NTU) Yes 0.843 4.97
Salinity (ppt) No 1.90 153
Note: Ln = natural logarithm; NTU = nephelometric turbidity unit; and ppt = part per thousand.

correlation could be explained by the small size of catchment as- Turbidity and salinity were also found highly correlated with the
sociated with each stream, which means all streams are influenced streamflow and thus eliminated from the data set. This is consistent
with similar weather, and particularly rainfall, patterns. Burton et al. with the idea of Thoe and Lee (2014) that salinity reflects the mix-
(2013) reported that the spatial correlation of rainfall at the site re- ing between riverine freshwater and the ambient seawater. The
mains higher than 0.5 for two points that are 100 km apart, imply- correlation analysis also showed high correlations between a time-
ing that the rainfall is correlated within a 100 × 100 ¼ 10,000-km2 lagged variable and the same variable with no time lag if the time
area. This area is larger than the sum of the watersheds of three ma- lag was not sufficiently long as expected. The correlation between
jor rivers discharging to the Bay (506.4 km2 ), namely the watershed the no-time-lag and time-lagged streamflow remained high (>0.6)
for River Tawe (227.7 km2 ), River Neath (190.9 km2 ), and River for a lag time of 0.25–36 h; only the streamflow with 10-h lag was
Afan (87.8 km2 ). selected following hydrodynamic model results (Lam and Ahmadian
2022).

Table 3. Predictive variables selected by the Gamma-GA tests and


stepwise MLRs
Enterococci E. coli
Stepwise Stepwise
Variables identified from Gamma-GA linear Gamma-GA linear
the correlation analysis test model test model
Streamflow (lag 10 h) 1 0 0 1
Mumbles level (lag 2 h) 0 0 0 0
Mumbles level (lag 4 h) 1 0 1 1
Mumbles level (lag 6 h) 1 1 1 0
Global radiation (lag 2 h) 0 1 0 1
Global radiation (lag 4 h) 0 0 1 0
Global radiation (lag 6 h) 1 0 0 0
Temperature (lag 2 h) 1 0 0 1
Temperature (lag 6 h) 1 0 1 0
Relative humidity (lag 2 h) 0 1 0 1
Relative humidity (lag 8 h) 1 0 1 0
Cumulative of rain (lag 2 h) 0 0 0 0
Cumulative of rain (lag 3 h) 0 0 0 0
Cumulative of rain (lag 4 h) 0 0 0 0
Cumulative of rain (lag 6 h) 0 0 0 0
Cumulative of rain (lag 8 h) 0 0 0 0
Cumulative of rain (lag 10 h) 0 0 0 0
Cumulative of rain (lag 12 h) 0 1 0 0
Wind speed N (lag 2 h) 0 1 1 1
Wind speed N (lag 6 h) 0 0 1 0
Wind speed N (lag 10 h) 0 1 0 1
Wind speed E (lag 2 h) 1 1 1 1
Wind speed E (lag 10 h) 0 1 0 0 Fig. 3. M-test results for (a) Enterococci, Realization 3; and (b) E. coli,
Realization 1.
Note: 1 = selected; and 0 = not selected.

© ASCE 04022093-6 J. Environ. Eng.

J. Environ. Eng., 2023, 149(2): 04022093


For other variables, the minimum time lag from the FIO data comparison requires a priori knowledge about the relative impor-
was 2 h to render the AI model predictive. Additional time lags tance of the variables in the site, which is not available to date.
were applied to these variables at suitable time intervals such that Nevertheless, the physical plausibility of the selected variables is
the correlations between time-lagged and unlagged variables were discussed as follows.
not significant (< 0.6). The time intervals determined by correlation Tide level was selected to be an important variable by both
analysis were as follows: 2 h for tides, 2 h for radiation, 6 h for Gamma-GA tests and stepwise MLR models. The results are con-
humidity;, 0.25 h for rainfall, 4 h for temperature, 4 h for northern sistent with the fact that tides were shown important to the flow in
wind speed (wind speed N), and 8 h for eastern wind speed (wind Swansea Bay (Ahmadian et al. 2013) as well as FIO concentrations
speed E). Table 3 indicates that the time interval for rainfall is (Lam and Ahmadian 2022). Tides were also identified as an impor-
Downloaded from ascelibrary.org by "Indian Institute of Technology (BHU), Varanasi" on 01/08/25. Copyright ASCE. For personal use only; all rights reserved.

greater than 0.25 h; it is because rainfall was not expected to have tant variable in other data-driven models for other nearshore coastal
an immediate effect on FIO concentrations from physical process waters (Crowther et al. 2001; Nevers and Whitman 2005; He and
point of view. He 2008; Zhang et al. 2012). Wind was also shown important for
Table 3 provides the predictive variables identified by the FIO concentration by both predictive variable identification meth-
Gamma-GA tests and stepwise MLRs. For consistency of compari- ods. It is consistent with the stepwise MLR results of Wyer et al.
son between the methods and prevent overparamatization, both (2013b). Streamflow, as a known FIO source (e.g., Wyer et al.
Gamma-GA tests and stepwise MLRs were constrained to choose 2010, 2013a; Lam and Ahmadian 2022), was included only by the
a maximum of eight variables. Ideally, an interpretability analysis Gamma-GA model for Enterococci and stepwise MLR for E. coli,
of the variables identified by Gamma-GA tests and stepwise MLRs but the variable was not included for other tests. This is attributed to
is desirable to assess the performance of Gamma-GA tests, but such the small spatial and temporal scale (in a watershed of about

Fig. 4. MSEs for the (a) training; (b) validation; and (c) testing sets of Fig. 5. MSEs for the (a) training; (b) validation; and (c) testing sets of
GG-ANN and SL-ANN models versus number of hidden layer nodes GG-ANN and SL-ANN models versus number of hidden layer nodes
for Enterococci, Realization 3. for E. coli, Realization 1.

© ASCE 04022093-7 J. Environ. Eng.

J. Environ. Eng., 2023, 149(2): 04022093


500 km2 and sampling interval of 30 min) of the site. In this study, ANN Model Results
flow rates of different rivers under a time lag of less than 36 h are
highly correlated, and one representative streamflow (River Tawe) at Selection of Number of Nodes
one particular time lag (10 h) was selected. Information concerning Fig. 4 shows a typical relationship between MSE and number of
the exact riverine FIO sources for the measured FIO concentration hidden layer nodes for Enterococci. SL-ANN models reached lower
was lost. In summary, the Gamma-GA test could identify predictive
MSEs when there were few nodes in the hidden layer. When there
variables that are consistent with the literature.
were more hidden layer nodes (>10), GG-ANN models achieved
better performance.
M -Test For E. coli, Fig. 5 shows that GG-ANN and SL-ANN
Downloaded from ascelibrary.org by "Indian Institute of Technology (BHU), Varanasi" on 01/08/25. Copyright ASCE. For personal use only; all rights reserved.

M-testing was conducted for variables identified by the Gamma- models gave similar MSEs. The fact that GG-ANN and SL-
GA tests and stepwise MLRs to determine the data length needed ANN models gave similar MSEs does not conflict with the
for model training. Fig. 3 shows that the Gamma-GA tests selected M-test results. Although the M-test results suggested that the
variables that achieve lower (i.e., better) jΓj compared with the Gamma-GA tests identified variables that had the potential to
stepwise MLRs given a sufficiently long data (e.g., beyond 500 achieve lower MSEs, the M-test does not specify the nonlinear
data points), which means that the data length for model training model that gives such results. It is possible that nonlinear mod-
should be greater than 200 for Gamma-GA test to give better results
els other than ANN give better results with the Gamma-GA
compared with stepwise MLR. Following the M-test, the ratio of
identified variables; however, this is out of the scope of this
data points in training, validation, and testing sets was 0.6∶0.2∶0.2,
giving 949 × 0.6 ¼ 571 data points in the training set, which sat- study. For further comparison between models developed from
isfies the minimum of 200 data point requirement imposed by the the Gamma-GA tests and stepwise MLRs, networks with 1 to 50
M-test. The mean and standard deviation values of the training, val- nodes in the hidden layer were tested, and the number of nodes
idation, and testing data sets were checked to be approximately in the hidden layer was selected based on providing the lowest
comparable in all three realizations. validation MSE.

Fig. 6. Regressions between target Enterococci concentrations and GG-ANN model outputs for (a) training; (b) validation; (c) testing; and (d) all data
sets, Realization 3. LCC = linear correlation coefficient.

© ASCE 04022093-8 J. Environ. Eng.

J. Environ. Eng., 2023, 149(2): 04022093


Downloaded from ascelibrary.org by "Indian Institute of Technology (BHU), Varanasi" on 01/08/25. Copyright ASCE. For personal use only; all rights reserved.

Fig. 7. Regressions between target E. coli concentrations and GG-ANN model outputs for (a) training; (b) validation; (c) testing; and (d) all data sets,
Realization 1.

Mean Squared Error and R2 MLR chooses variables that optimize linear model performance
Figs. 6 and 7 show the comparison between GG-ANN model re- compared with the Gamma-GA test.
sults and target FIO concentrations for training, validation, and test
data sets, as well as all data. Tables 4 and 5 present the comparison Performance Tables
among GG-ANN, SL-ANN, GG-Linear, and SL-Linear models.
For most ANN models, the optimal SL-ANN models consisted The ability to identify the most hazardous circumstances, namely
of fewer hidden layer nodes compared with GG-ANN models, poor water quality conditions, is particularly important when a
real-time predictive model is used as an early warning system. The
which is consistent with the “Selection of Number of Nodes”
EU rBWD (European Commission 2006) considers the water quality
section.
in a bathing site to be poor if the 90-percentile FIO concentration in
GG-ANN and SL-ANN models gave better MSE and R2 than
the lognormal distribution obtained from the last assessment period
GG-Linear and SL-Linear models. This shows the capacity of non-
(usually the last four bathing seasons) exceeds a given threshold. The
linear models in capturing inherent nonlinear relationships between threshold is 185 colony-forming unit per 100 mL (cfu=100 mL)
variables and FIO concentrations. GG-ANN models gave better for Enterococci and 500 cfu=100 mL for E. coli. In this study,
training, validation, and testing results than SL-ANN models for the use of 90-percentile values was not sensible because water qual-
Enterococci, but SL-ANN models gave better validation and test- ity is being predicted at a 30-min interval. To test the models’ ability
ing results for E. coli. The better GG-ANN performance for Entero- to identify poor water quality events, individual Enterococci and
cocci can be explained by the fact that GG-ANN better captures E. coli concentration values were compared with the 185 and
extreme FIO concentrations, as illustrated in the “Performance 500 cfu=100 mL thresholds, respectively. Fig. 8 shows the perfor-
Table” section. This GG-ANN property helps the models perform mance tables of the data-driven models in correctly predicting poor
better for Enterococci because there are more extreme values for water quality under the EU rBWD classification for the testing sets.
the data series of Enterococci (17.8% of the data was below 0.1 In this context, sensitivity and specificity are defined as follows:
or above 0.9) compared with E. coli (8.9% of the data was below
0.1 or above 0.9). The MSE of SL-Linear models was better than Correctly predicted poor water quality
Sensitivity ¼ ð14Þ
the one of the GG-Linear models, verifying the fact that stepwise Observed poor water quality

© ASCE 04022093-9 J. Environ. Eng.

J. Environ. Eng., 2023, 149(2): 04022093


Table 4. MSE and unadjusted R2 between computed and measured Enterococci concentrations
MSE R2
Hidden layer
Model node number Training Validation Testing Training Validation Testing
Realization 1
GG-ANN 35 0.0074 0.0157 0.0214 0.8369 0.6654 0.5361
SL-ANN 11 0.0210 0.0232 0.0260 0.5400 0.5079 0.4357
GG-linear N/A 0.0357 0.0399 0.2224 0.1348
SL-linear N/A 0.0311 0.0328 0.3235 0.2883
Realization 2
Downloaded from ascelibrary.org by "Indian Institute of Technology (BHU), Varanasi" on 01/08/25. Copyright ASCE. For personal use only; all rights reserved.

GG-ANN 38 0.0134 0.0172 0.0227 0.7177 0.6025 0.4993


SL-ANN 6 0.0257 0.0194 0.0295 0.4542 0.5518 0.3611
GG-linear N/A 0.0368 0.0352 0.2021 0.2246
SL-linear N/A 0.0312 0.0322 0.3229 0.2895
Realization 3
GG-ANN 40 0.0071 0.0188 0.0199 0.8292 0.6457 0.6156
SL-ANN 12 0.0192 0.0243 0.0225 0.5385 0.5418 0.5699
GG-linear N/A 0.0359 0.0393 0.1944 0.2403
SL-linear N/A 0.0320 0.0293 0.2812 0.4337
Note: Bold indicates the best results for each realization.

Table 5. MSE and unadjusted R2 between computed and measured E. coli concentrations
MSE R2
Hidden layer
Model node number Training Validation Testing Training Validation Testing
Realization 1
GG-ANN 25 0.0067 0.0159 0.0214 0.8396 0.6077 0.5312
SL-ANN 18 0.0096 0.0154 0.0174 0.7705 0.6198 0.6177
GG-linear N/A 0.0325 0.0342 0.2166 0.2482
SL-linear N/A 0.0297 0.0305 0.2838 0.3290
Realization 2
GG-ANN 13 0.0119 0.0178 0.0205 0.7208 0.5939 0.4801
SL-ANN 18 0.0113 0.0135 0.0188 0.7370 0.6914 0.5221
GG-linear N/A 0.0331 0.0315 0.2294 0.1996
SL-linear N/A 0.0299 0.0295 0.3041 0.2484
Realization 3
GG-ANN 25 0.0073 0.0146 0.0196 0.8066 0.6747 0.6337
SL-ANN 27 0.0091 0.0155 0.0186 0.7582 0.6539 0.6501
GG-linear N/A 0.0321 0.0363 0.1873 0.3181
SL-linear N/A 0.0297 0.0310 0.2501 0.4167
Note: Bold indicates the best results for each realization.

Correctly predicted not poor water quality The sensitivity of SL-Linear models was better than the one of GG-
Specificity ¼ ð15Þ
Observed not poor water quality Linear models as expected from the MSE and R2 results. Specific-
ities for all the models tested were always higher than 90%. This is
To explain, sensitivity represents the likelihood that a poor-water- probably due to the small proportion of poor-water-quality events
quality event is correctly predicted. Specificity represents the like- during the study period.
lihood that a not-poor-water-quality event is correctly predicted. It Sensitivity was higher for Enterococci concentrations than for
can be alternatively interpreted as a minus false alarm rate. E. coli in Fig. 8 for GG-ANN models. The same was observed in
Being consistent with the result that the ANN models gave bet- the performance tables for the entire data set (949 data points). An
ter MSE and R2 values, the ANN models gave significantly higher explanation can be given from the probability distribution of the
sensitivity than the linear regression models: 24%–62% and 0%–
Enterococci and E. coli data. A chi-square test showed that both
14%, respectively. The observation that nonlinear models give
data follow lognormal distributions with a confidence level of
more accurate FIO predictions is consistent with Thoe et al. (2015).
The results are also consistent with those of Zhang et al. (2012) that above 95% if very small values (<3 cfu=100 mL) were removed.
ANNs capture extreme FIO values better than linear regression From the respective probability distribution functions, the exceed-
models. Sensitivities of GG-ANN models were higher than the ance probability of the Enterococci threshold (15.9%) was higher
ones for SL-ANN models for both Enterococci and E. coli, despite than the one of E. coli (5.5%) from the entire data set. With a lower
the MSEs for SL-ANN models being lower than the ones for GG- exceedance probability, models that better capture extreme values
ANN models for E. coli. It suggests that GG-ANN models better are required to achieve a better sensitivity for E. coli compared with
capture high FIO concentrations compared with SL-ANN models. Enterococci concentrations.

© ASCE 04022093-10 J. Environ. Eng.

J. Environ. Eng., 2023, 149(2): 04022093


Downloaded from ascelibrary.org by "Indian Institute of Technology (BHU), Varanasi" on 01/08/25. Copyright ASCE. For personal use only; all rights reserved.

Observed Observed
Not poor Poor Not poor Poor
Predicted

Predicted
Not poor 146 11 93% Not poor 170 6 97%

Poor 14 18 56% Poor 8 5 38%


91% 62% 87% 96% 45% 93%
(a) (b)

Observed Observed
Not poor Poor Not poor Poor
Predicted

Predicted
Not poor 155 22 88% Not poor 172 8 96%

Poor 5 7 58% Poor 6 3 33%


97% 24% 86% 97% 27% 93%
(c) (d)

Observed Observed
Not poor Poor Not poor Poor
Predicted

Predicted

Not poor 160 29 85% Not poor 178 11 94%

Poor 0 0 N/A Poor 0 0 N/A


100% 0% 85% 100% 0% 94%
(e) (f)

Observed Observed
Not poor Poor Not poor Poor
Predicted

Predicted

Not poor 157 25 86% Not poor 178 10 95%

Poor 3 4 57% Poor 0 1 100%


98% 14% 85% 100% 9% 95%
(g) (h)

Fig. 8. Performance tables for data-driven models: (a and b) GG-ANN models; (c and d) SL-ANN models; (e and f) GG-Linear models; and
(g and h) SL-Linear models. Plots (a, c, e, and g) show Enterococci Realization 1 testing sets, and plots (b, d, f, and h) show E. coli Realization
3 testing sets.

Discussion models, this study confirms their findings with the same number of
explanatory variables used in the ANN and linear models. This dem-
Gamma testing is a promising tool to identify input data for a data- onstrates the importance of including nonlinearity in capturing high
driven model because it is nonlinear, and it does not require a re- FIO concentrations. The effect of nonlinearity of ANN was also re-
gression equation a priori. Nevertheless, these advantages do not flected in higher sensitivities of GG-ANN and SL-ANN models
imply that gamma tests can be applied with no knowledge about the compared with GG-Linear and SL-Linear models.
waterbody or the data to which the test is applied. In this paper, a Comparing GG-ANN and SL-ANN models, GG-ANN mod-
correlation analysis was conducted prior to the Gamma-GA tests to els gave better results for Enterococci for all training, validation,
remove highly correlated candidate predictive variables. and testing sets and most of the training sets for E. coli. Although
The GG-ANN and SL-ANN models fitted better to the mea- SL-ANN models gave better testing results for E. coli, GG-ANN
sured FIO concentrations and captured better the extreme FIO con- models gave higher sensitivities for both Enterococci and E. coli,
centrations compared with GG-Linear and SL-Linear models. This showing that Gamma-GA models select variables that capture bet-
is consistent with Zhang et al. (2012) for FIO concentrations and ter extreme FIO concentrations compared with the stepwise MLR.
Keiner and Yan (1998) for chlorophyll-a and suspended sediments. The results suggest that GG-ANN models are more suitable for
Whereas Zhang et al. (2012) arrived at this conclusion by comparing bathing water quality warning applications in which predicting high
15-variable ANN models to five- or six-variable linear regression FIO concentrations is the major concern.

© ASCE 04022093-11 J. Environ. Eng.

J. Environ. Eng., 2023, 149(2): 04022093


This paper presented a GG-ANN model training and validation Authority in Wales), and unfortunately, the authors have not been
framework that is generally applicable to different sites, although authorized to share the data set.
a new GG-ANN model development is required for every new
study. Once the GG-ANN model is developed, it can discern criti-
cal parameters from redundant parameters for water quality predic- Acknowledgments
tion and to keep the sampling cost of running the model in real-time
limited by only measuring critical parameters. The data used in this The study is carried out as a part of the EERES4WATER (Promot-
paper had a very short (0.5 h) sampling interval, whereas the com- ing Energy-Water nexus resource efficiency through renewable
monly used sampling intervals are usually in the order of days (He energy and energy efficiency) project, which is co-financed by the
and He 2008; Zhang et al. 2012; Thoe et al. 2015; Zhang et al. Interreg Atlantic Area Programme through the European Regional
Downloaded from ascelibrary.org by "Indian Institute of Technology (BHU), Varanasi" on 01/08/25. Copyright ASCE. For personal use only; all rights reserved.

2018). The models used in this paper generally gave lower R2 than Development Fund under EAPA 1058/2018.
the daily predictions of water quality in the literature. This high-
lights the difficulty of short-term water quality prediction; further
study of the effect of time-scale on prediction accuracy is needed. Notation
Although Zhang et al. (2018) attempted to predict water quality
with ANN at different time-scales, their results were not conclu- The following symbols are used in this paper:
sive. Nevertheless, this paper highlights the potential for a combi- A = slope for the gamma test regression equation;
nation of Gamma-GA test and a nonlinear predictive model to give fð·Þ = nonlinear smooth function relating xi and yi ;
timely bathing water quality prediction. This can be used as early M = total number of time instants of the data;
warning systems on impending poor water quality together with r = random variable (noise);
real-time environmental sensors. X = linearly independent environmental variables;
xi = linearly independent environmental variables at time
instant i (i.e., the data point at i);
Conclusion xi0 = imaginary environmental variables at time instant i;
This paper developed a data-driven model for FIO concentration xj = jth environmental variable;
prediction with only limited number of critical and without unnec- xj;max = maximum value of variable xj ;
essary input variables. The performance of the Gamma-GA test as a xj;min = minimum value of variable xj ;
tool for predictive variable identification of Enterococci and E. coli xj;nor = normalized xj ;
at an interval of 30 min was evaluated. ANN and linear regression xN½i;k = kth nearest data point to xi ;
models were developed from the variables identified from Gamma- y = target data (FIO concentrations);
GA tests and stepwise MLR for comparison. The GG-ANN models yN½i;k = target data value associated with xN½i;k ;
gave better results for Enterococci for all training, validation, and γ = estimate of variability among xi , 1 ≤ i ≤ M;
testing sets and most of the training sets for E. coli. Γ = intercept for the gamma test regression; and
The results also demonstrated the potential for Gamma-GA δ = estimate of variability among yi , 1 ≤ i ≤ M.
tests to identify variables that give a better model compared with
stepwise MLR. Although SL-ANN models usually gave better
MSE and R2 for testing results of E. coli, the GG-ANN model was
References
better in identifying events of poor water quality. This illustrates
the merit of nonlinear variable identification approach—the varia- Abu-Bakar, A., R. Ahmadian, and R. A. Falconer. 2017. “Modelling the
bles identified are more capable of predicting high FIO concentra- transport and decay processes of microbial tracers in a macro-tidal es-
tions. Therefore, the GG-ANN model is more suitable for bathing tuary.” Water Res. 123 (4): 802–824. https://doi.org/10.1016/j.watres
water warning applications in which predicting high FIO concen- .2017.07.007.
trations is the major concern. For the two variable identification Ahmadian, R., S. Bomminayuni, R. Falconer, and T. Stoesser. 2013.
approaches, ANN models were better than linear regression mod- Numerical modelling of flow and faecal indicator organism transport
els in terms of MSE and R2 as well as sensitivity. This result again at Swansea Bay, UK. Technical Rep. for the Interreg 4a Smart Coasts—
Sustainable Communities Project. Cardiff, Wales: Cardiff Univ.
highlighted the importance of including nonlinear effects in predic-
Belsley, D. A., E. Kuh, and R. E. Welsch. 1980. Regression diagnostics:
tion models.
Identifying influential data and sources of collinearity. New York:
In conclusion, this paper demonstrated the potential of combin- Wiley.
ing the Gamma-GA test and ANN to predict bathing water quality. Bentley, J. 1975. “Multidimensional binary search trees used for associa-
Prior to the variable identification tests, a correlation analysis was tive search.” Commun. ACM 18 (9): 509–517. https://doi.org/10.1145
conducted to remove redundant variables in the data set. The need /361002.361007.
of such an analysis illustrates the importance of understanding the Boehm, A. B., S. B. Grant, J. H. Kim, S. L. Mowbray, C. D. McGee, C. D.
data set in the development and application of data-driven models. Clark, D. M. Foley, and D. E. Wellman. 2002. “Decadal and shorter
period variability of surf zone water quality at Huntington Beach,
California.” Environ. Sci. Technol. 36 (18): 3885–3892. https://doi
.org/10.1021/es020524u.
Data Availability Statement Burton, A., V. Glenis, M. R. Jones, and C. G. Kilsby. 2013. “Models
of daily rainfall cross-correlation for the United Kingdom.” Environ.
Some data, models, or codes that support the findings of this study Modell. Software 49 (Feb): 22–33. https://doi.org/10.1016/j.envsoft
are available from the corresponding author upon reasonable re- .2013.06.001.
quest (MATLAB codes for Gamma-GA tests and the ANN models Choubin, B., and A. Malekian. 2017. “Combined gamma and m-test-based
and model outputs). The bacteria and environmental data were col- ANN and ARIMA models for groundwater fluctuation forecasting in
lected by Aberystwyth University, Swansea City Council, and Nat- semiarid regions.” Environ. Earth Sci. 76 (7): 538. https://doi.org/10
ural Resources Wales (Environmental Protection and Regulatory .1007/s12665-017-6870-8.

© ASCE 04022093-12 J. Environ. Eng.

J. Environ. Eng., 2023, 149(2): 04022093


Crowther, J., D. Kay, and M. D. Wyer. 2001. “Relationships between micro- Lam, M. Y., and R. Ahmadian. 2022. “Numerical source-receptor connec-
bial water quality and environmental conditions in coastal recreational tivity study in nearshore coastal waters.” In Proc., 39th IAHR World
waters: The Fylde Coast, UK.” Water Res. 35 (17): 4029–4038. https:// Congress, edited by M. Ortega-Sánchez, 19–24. Madrid, Spain:
doi.org/10.1016/S0043-1354(01)00123-3. International Association for Hydro-Environment Engineering and
Deb, K. 2000. “An efficient constraint handling method for genetic algo- Research.
rithms.” Comput. Methods Appl. Mech. Eng. 186 (99): 311–338. https:// Lee, J., and B. Qu. 2004. “Hydrodynamic tracking of the massive spring
doi.org/10.1016/S0045-7825(99)00389-8. 1998 red tide in Hong Kong.” J. Environ. Eng. 130 (5): 535–550.
Deep, K., K. P. Singh, M. L. Kansal, and C. Mohan. 2009. “A real coded https://doi.org/10.1061/(ASCE)0733-9372(2004)130:5(535).
genetic algorithm for solving integer and mixed integer optimization Lin, B., M. Syed, and R. A. Falconer. 2008. “Predicting faecal indicator
problems.” Appl. Math. Comput. 212 (2): 505–518. https://doi.org/10 levels in estuarine receiving waters—An integrated hydrodynamic
and ANN modelling approach.” Environ. Modell. Software 23 (Sep):
Downloaded from ascelibrary.org by "Indian Institute of Technology (BHU), Varanasi" on 01/08/25. Copyright ASCE. For personal use only; all rights reserved.

.1016/j.amc.2009.02.044.
Dufour, A. P. 1984. “Bacterial indicators of recreational water quality.” 729–740. https://doi.org/10.1016/j.envsoft.2007.09.009.
Can. J. Public Health 75 (1): 49–56. Masters, T. 1993. Practical neural network recipes in C++. San Diego:
European Commission. 2006. “Directive 2006/7/EC of the European Academic Press.
parliament and of the council of 15 February 2006 concerning the man- Mathworks. 2020a. MATLAB deep learning toolbox user’s guide. Boston:
agement of bathing water quality and repealing directive 76/160/EEC.” Mathworks.
OJEU 64 (40): 37–51. Mathworks. 2020b. MATLAB global optimization toolbox user’s guide.
Evans, D., and A. J. Jones. 2002. “A proof of the gamma test.” Proc. R. Soc. Boston: Mathworks.
London, Ser. A 458 (2): 2759–2799. https://doi.org/10.1098/rspa.2002 Nevers, M. B., and R. L. Whitman. 2005. “Nowcast modeling of Escher-
.1010. ichia coli concentrations at multiple urban beaches of southern Lake
Evans, G. P., B. M. Mollowney, and N. C. Spoel. 1990. “Two-dimensional Michigan.” Water Res. 39 (5): 5250–5260. https://doi.org/10.1016/j
modelling of the Bristol Channel, UK.” In Proc., Conf. on Estuarine .watres.2005.10.012.
and Coastal Modelling, edited by M. L. Spaulding, 331–340. Reston, Pandey, P. K., P. H. Kass, M. L. Soupir, S. Biswas, and V. Singh. 2014.
VA: ASCE. “Contamination of water resources by pathogenic bacteria.” AMB
Garrett, J. J. 1994. “Where and why artificial neural networks are applicable Express 4 (51): 5250–5260. https://doi.org/10.1186/s13568-014
in civil engineering.” J. Comput. Civ. Eng. 8 (2): 129–130. https://doi -0051-x.
.org/10.1061/(ASCE)0887-3801(1994)8:2(129). Pruss, A. 1998. “Review of epidemiological studies on health effects from
Ghaderi, K., B. Motamedvaziri, M. Vafakhah, and A. A. Dehghani. 2019. exposure to recreational water.” Int. J. Epidemiol. 27 (4): 1–9. https://
doi.org/10.1093/ije/27.1.1.
“Regional flood frequency modeling: A comparative study among sev-
eral data-driven models.” Arab. J. Geosci. 12 (18): 588. https://doi.org Russell, S. J., and P. Norvig. 2010. Artificial intelligence—A modern
/10.1007/s12517-019-4756-7. approach. 3rd ed. London: Pearson.
Schippmann, B., G. Schernewhki, and U. Grawe. 2013. “Escherichia coli
Gonzalez, R. A., K. E. Conn, J. R. Crosswell, and R. T. Noble. 2012. “Ap-
pollution in a Baltic Sea lagoon: A model-based source and spatial risk
plication of empirical predictive modeling using conventional and alter-
assessment.” Int. J. Hyg. Environ. Health 216 (4): 408–420. https://doi
native fecal indicator bacteria in eastern North Carolina waters.” Water
.org/10.1016/j.ijheh.2012.12.012.
Res. 46 (18): 5871–5882. https://doi.org/10.1016/j.watres.2012.07.050.
Stefansson, A., N. Koncar, and A. J. Jones. 1997. “A note on the gamma
Gonzalez, R. A., and R. T. Noble. 2014. “Comparisons of statistical models
test.” Neural Comput. Appl. 5 (14): 131–133. https://doi.org/10.1007
to predict fecal indicator bacteria concentrations enumerated by QPCR-
/BF01413858.
and culture-based methods.” Water Res. 48 (Jan): 296–305. https://doi
Stone, M. 1974. “Cross-validatory choice and assessment of statistical pre-
.org/10.1016/j.watres.2013.09.038.
dictions.” J. R. Stat. Soc. Ser. B 36 (4): 117–147. https://doi.org/10.1111
He, L. M., and Z. L. He. 2008. “Water quality prediction of marine recrea-
/j.2517-6161.1974.tb00994.x.
tional beaches receiving watershed baseflow and stormwater runoff in
Thoe, W., M. Gold, A. Griesbach, M. Grimmer, M. L. Taggart, and
southern California, USA.” Water Res. 42 (10–11): 2563–2573. https://
A. B. Boehm. 2015. “Sunny with a chance of gastroenteritis: Predicting
doi.org/10.1016/j.watres.2008.01.002.
swimmer risk at California beaches.” Environ. Sci. Technol. 49 (4):
Huang, G., R. A. Falconer, and B. Lin. 2017. “Integrated hydro-bacterial 423–431. https://doi.org/10.1021/es504701j.
modelling for predicting bathing water quality.” Estuarine Coastal Shelf Thoe, W., and J. H. W. Lee. 2014. “Daily forecasting of Hong Kong
Sci. 188 (Jan): 145–155. https://doi.org/10.1016/j.ecss.2017.01.018. beach water quality by multiple linear regression models.” J. Environ.
Iyer, M. S., and R. R. Rhinehart. 1999. “A method to determine the required Eng. 140 (2): 04013007. https://doi.org/10.1061/(ASCE)EE.1943-7870
number of neural-network training repetitions.” IEEE Trans. Neural .0000800.
Networks 10 (2): 427–432. https://doi.org/10.1109/72.750573. Uncles, R. J. 1981. “A numerical simulation of the vertical and horizontal
Jin, G., and A. J. Englande. 2006. “Prediction of swimmability in a brackish M2 tide in the Bristol Channel and comparisons with observed data.”
water body.” Manage. Environ. Qual. 17 (2): 197–208. https://doi.org Limnol. Oceanogr. 26 (3): 571–577. https://doi.org/10.4319/lo.1981.26
/10.1108/14777830610650500. .3.0571.
Jones, A. J. 2004. “New tools in non-linear modelling and prediction.” CMS USEPA. 2010. Predictive tools for beach notification, vol I: Review and
1 (Jun): 109–149. https://doi.org/10.1007/s10287-003-0006-1. technical protocol. Rep. No. EPA-823-R-10-003. Washington, DC:
Kashefipour, S. M., B. Lin, and R. A. Falconer. 2005. “Neural networks for USEPA.
predicting seawater bacterial levels.” Water Manage. 158 (3): 111–118. Whitman, R. L., M. B. Nevers, G. C. Korinek, and M. N. Byappanahalli.
https://doi.org/10.1680/wama.2005.158.3.111. 2004. “Solar and temporal effects on Escherichia coli concentration at a
Keiner, L. E., and X. Yan. 1998. “A neural network model for estimating Lake Michigan swimming beach.” Appl. Environ. Microbiol. 70 (7):
sea surface chlorophyll and sediments from thematic mapper imagery.” 4276–4285. https://doi.org/10.1128/AEM.70.7.4276-4285.2004.
Remote Sens. Environ. 66 (98): 153–165. https://doi.org/10.1016/S0034 Wyer, M. D., et al. 2013a. Faecal indicator source connectivity for inputs to
-4257(98)00054-6. Swansea Bay, south Wales, UK. Technical Rep. for the Interreg 4a
Kim, J. H., S. B. Grant, C. D. McGee, B. F. Sanders, and J. L. Largier. 2004. Smart Coasts—Sustainable Communities Project. Aberystwyth, Wales:
“Locating sources of surf zone pollution: A mass budget analysis of Aberystwyth Univ.
fecal indicator bacteria at Huntington Beach, California.” Environ. Sci. Wyer, M. D., et al. 2013b. Statistical modelling of faecal indicator organisms
Technol. 38 (9): 2626–2636. https://doi.org/10.1021/es034831r. at a marine bathing water site: Results of an intensive study at Swansea
King, J., R. Ahmadian, and R. Falconer. 2021. “Hydro-epidemiological Bay, UK. Technical Rep. for the Interreg 4a Smart Coasts—Sustainable
modelling of bacterial transport and decay in nearshore coastal waters.” Communities Project. Aberystwyth, Wales: Aberystwyth Univ.
Water Res. 196 (Nov): 117049. https://doi.org/10.1016/j.watres.2021 Wyer, M. D., D. Kay, H. Morgan, S. Naylor, S. Clark, J. Watkins, C. M.
.117049. Davies, C. Francis, H. Osborn, and S. Bennett. 2018. “Within-day

© ASCE 04022093-13 J. Environ. Eng.

J. Environ. Eng., 2023, 149(2): 04022093


variability in microbial concentrations at a UK designated bathing Wales, UK.” Water Res. 44 (16): 4783–4795. https://doi.org/10.1016/j
water: Implications for regulatory monitoring and the application .watres.2010.06.047.
of predictive modelling based on historical compliance data.” Zhang, J., H. Qiu, X. Li, J. Niu, M. B. Nevers, X. Hu, and M. S.
Water Res. X 1 (Oct): 100006. https://doi.org/10.1016/j.wroa.2018 Phanikumar. 2018. “Real-time nowcasting of microbiological water
.10.003. quality at recreational beaches: A wavelet and artificial neural network
Wyer, M. D., D. Kay, J. Watkins, C. Davies, C. Kay, R. Thomas, J. Porter, based hybrid modelling approach.” Environ. Sci. Technol. 52 (15):
C. M. Stapleton, and H. Moore. 2010. “Evaluating short-term changes 8446–8455. https://doi.org/10.1021/acs.est.8b01022.
in recreational water quality during a hydrograph event using a com- Zhang, Z., Z. Deng, and K. A. Rusch. 2012. “Development of predictive
bination of microbial tracers, environmental microbiology, microbial models for determining Enterococci levels at Gulf Coast beaches.” Water
source tracking and hydrological techniques: A case study in southwest Res. 46 (2): 465–474. https://doi.org/10.1016/j.watres.2011.11.027.
Downloaded from ascelibrary.org by "Indian Institute of Technology (BHU), Varanasi" on 01/08/25. Copyright ASCE. For personal use only; all rights reserved.

© ASCE 04022093-14 J. Environ. Eng.

J. Environ. Eng., 2023, 149(2): 04022093

You might also like