Lam Ahmadian 2022 Predicting Fecal Indicator Organisms in Coastal Waters Using a Complex Nonlinear Artificial
Lam Ahmadian 2022 Predicting Fecal Indicator Organisms in Coastal Waters Using a Complex Nonlinear Artificial
Abstract: High levels of fecal-indicator organisms (FIOs) at bathing water sites can cause disease and impose threats to public health. There is a
need for predicting FIO levels to inform the public and reduce exposure. Data-driven models are one of the main tools being considered as
predictive models. However, identifying the main inputs of the data-driven models is a major challenge in developing FIO predictor models. This
paper develops a data-driven model for FIO concentration prediction based on a limited number of critical input variables. Essential variables
were identified with be a combination of the gamma test and Genetic Algorithm (Gamma-GA test). Artificial neural networks (ANNs) and linear
regression models were developed using these two variable identification approaches for comparison. The models were applied to a case study,
and it was found that the model using the Gamma-GA test has a high potential to predict FIO levels more accurately, although this requires
further investigation with different case studies. A correlation analysis was required prior to the variable identification approaches in this study.
The need of this analysis highlights the significance of understanding the waterbody and the data set in the development and application of data-
driven models. Models using a Gamma-GA test were more capable of predicting extreme (high) FIO concentrations, making a Gamma-GA test
more suitable for a bathing water quality early warning system. The importance of nonlinearity in such predictive models was also demonstrated
by the better performance of nonlinear ANN models compared with linear regression models regardless of the variable identification approaches
used. This paper highlights the importance of nonlinearity in bathing water quality prediction and encourages further utilization of nonlinear
models for this application. DOI: 10.1061/JOEEDU.EEENG-6986. © 2022 American Society of Civil Engineers.
Introduction Although field sampling and analysis are important, they do not
provide predictions to impending bathing water quality.
Waterborne pathogens in waterbodies cause illnesses such as gastro- Two- and three-dimensional hydroenvironmental models are
intestinal infections, eye infections, skin complaints, and nose and commonly applied to assess FIO concentrations in waterbodies.
throat infections (Pruss 1998; Pandey et al. 2014). Fecal-indicator These models numerically solve the mass and momentum equa-
organisms (FIOs), e.g., E. coli and Enterococci, are commonly used tions of fluids as well as the fate and transport of FIOs, including
to indicate the level of pathogens in waterbodies (Dufour 1984; decay and interaction with sediment. These models have been ap-
Pandey et al. 2014). In Europe, the European Union (EU)’s revised plied in a wide range of studies (e.g., Lee and Qu 2004; Lin et al.
Bathing Water Directive (rBWD) (European Commission 2006) 2008; Schippmann et al. 2013; Huang et al. 2017; Abu-Bakar et al.
requires member states to monitor at least the concentrations of 2017) to provide relatively accurate predictions of the spatial and
two FIO species in designated bathing waters for compliance. The temporal concentration distributions of FIOs. Nevertheless, these
rBWD recognizes short-term occasional pollution and includes pro- models require detailed knowledge of flow and FIOs at the boun-
visions for discounting compliance requirements when there is a dary of the modeling domain, which are generally very expensive
predictive and warning system to alarm the public of impending and time-consuming to acquire. Moreover, such models are usu-
poor water quality. ally computationally demanding and require a long run time even
Traditionally, FIO concentrations in bathing water samples are on modern computers. Therefore, using such models in real-time
determined by culture-based methods. These methods require a min- as a part of early warning systems is not practical.
imum of an 18–24 h (USEPA 2010) laboratory assay. However, FIO Data-driven models are promising alternatives in providing
concentrations change continuously (Boehm et al. 2002; Whitman timely predictions of FIOs for bathing water quality warning systems
et al. 2004; Kim et al. 2004; King et al. 2021), causing culture-based due to their lower computational requirements. Such models utilize
warning systems to give outdated water quality alerts. More rapid data obtained by environmental sensors to predict FIO concentra-
FIO analysis methods such as quantitative polymerize chain reaction tions in bathing waters. The public could then be warned about
(qPCR) can determine FIO concentrations in under 6 h, but these occasions with high FIO concentrations. The development of such
methods require significant up-front investments and trained person- data-driven models requires identifying FIO predictive variables
nel (Zhang et al. 2018) and still cannot be used as a predictive tool. and establishing relationships between these variables and FIO
concentrations. These two steps have been previously conducted
1
Research Associate, School of Engineering, Cardiff Univ., Cardiff mainly by stepwise multilinear regression (MLR) (e.g., Crowther
CF24 3AA, UK (corresponding author). ORCID: https://orcid.org/0000 et al. 2001; Nevers and Whitman 2005; Gonzalez et al. 2012; Wyer
-0001-7259-968X. Email: [email protected] et al. 2013b; Gonzalez and Noble 2014). In a stepwise MLR, var-
2
Professor, School of Engineering, Cardiff Univ., Cardiff CF24 3AA, iables are included or excluded in a linear regression equation in
UK. ORCID: https://orcid.org/0000-0003-2665-4734
Note. This manuscript was submitted on April 13, 2022; approved on
a stepwise manner. Such inclusion or exclusion is decided by the
October 5, 2022; published online on December 6, 2022. Discussion per- influence of the input variables on estimating the target variables
iod open until May 6, 2023; separate discussions must be submitted for through linear regression analysis. This linear approach does not
individual papers. This paper is part of the Journal of Environmental account for possible nonlinear relationships that could affect
Engineering, © ASCE, ISSN 0733-9372. predicting extremes.
6 .. .. .. .. 7
7 Γ¼ ðr − r1 Þ2 ¼ VarðrÞ ð11Þ
4 . . . 5 2M i¼1 2
Downloaded from ascelibrary.org by "Indian Institute of Technology (BHU), Varanasi" on 01/08/25. Copyright ASCE. For personal use only; all rights reserved.
three realizations in the M-test, the data were grouped into the three ces Wales (NRW), the official natural resource management organi-
sets according to the data order (i.e., the first to nth data were put into zation in Wales. The water depths and velocities at the five smaller
the training set; the n þ 1th to mth data were put into the validation streams were measured by OTT Orpheus Mini pressure transducers
set; the m þ 1th to the end of the data were put into the testing set; (OTT HydroMet, Kempten, Germany) and Sensa RC2 electromag-
n < m). The performance function used in this study was the mean netic velocity meters (Aqua Data, Oxfordshire, UK), respectively.
squared error (MSE) between model outputs and target data. Train- Global radiation, temperature, relative humidity, rainfall, and wind
ing of a network was stopped when no further improvement in MSE speed were measured at the meteorological station.
for the validation data set could be achieved after six iterations. Global radiation was measured by a SKS 1110 pyranometer
Although this method avoids parameter (weights and bias) over- (Skye Instruments, Llandrindod Wells, UK); air temperature and
training, it does not avoid overtraining due to overcomplicated relative humidity were measured by a HygroClip2 Hc2-S3 sensor
network architecture and redundant predictive variables. This fact (Rotronic, Crawley, UK); rainfall was measured with a 370C
highlights the importance of Gamma-GA test in rejecting redun- tipping-bucket rain gauge (20.3 cm aperture and 0.2 mm tip;
dant variables. For each network, 300 training runs with random Met One Instruments, Grants Pass, Oregon); Wind speed and di-
initial weights were conducted, and the network that provided rection were measured with a WindSonic anemometer (Gill Instru-
the minimum MSE was chosen. Iyer and Rhinehart (1999) showed ments, Hampshire, UK). Offshore wastewater discharge volumes
that the network obtained from this approach has a 95% confidence were also measured in the SCSC project but were not included
level that its MSE is within the lowest 1.0%. as potential model outputs because the tracer study conducted as
Linear regression models with their predictive variables selected a part of the SCSC project (Ahmadian et al. 2013) and the two-
by the Gamma-GA tests (GG-Linear models) and stepwise MLR dimensional TELEMAC hydrodynamic simulation conducted by
models (SL-Linear models) were also developed to assess ANN the authors suggested that they are not important for FIO concen-
model performance. trations at the DSPs compared with other FIO sources (not shown).
Tables 1 and 2 summarize the data set used in this paper. The
target variables were Enterococci and E. coli concentrations mea-
Model Application sured at the Bathing Water DSPs during a bathing season, namely
June 22 to September 28, 2011. The input data set included 16 envi-
The model was applied to Swansea Bay, located on the north of the ronmental variables measured in the same bathing season as indi-
Bristol Channel in the southwest of the UK, as shown in Fig. 2(a). cated in Table 2. The range of values of different input and target
Along the bay are two sandy beaches with bathing water status: variables were significantly different, as detailed in Tables 1 and 2,
the Swansea Beach and Aberavon Beach. Potential sources of FIO due to the large number of factors that affect bacteria concentra-
in the Bay are the discharges from rivers, streams, surface water tions. In order to ensure consistency between data and reduce the
drains, three offshore outfalls from wastewater treatment works, impact of variation ranges on the model, all the data have been nor-
and transport by currents from sources outside of the Bay. Large malized to the range of 0–1 using the following equation:
amounts of data were collected as a part of the previous Smart
xj − xj;min
Coast Sustainable Communities (SCSC) research project (Wyer xj;nor ¼ ð13Þ
et al. 2013b, 2018), which, alongside the variety of the sources, xj;max − xj;min
make the Bay an ideal case study for data-driven modeling. The
stream and drain discharges are generally low (<1 m3 =s); Rivers where xj and xj;nor = unnormalized and normalized jth variable;
Tawe, Clyne, Neath, and Afan have relatively high flow rates and xj;max and xj;min = maximum and minimum of the time series
(>5 m3 =s). The water is well-mixed in the Severn Estuary and xj before data processing and model training.
Bristol Channel (Uncles 1981; Evans et al. 1990; Ahmadian et al. Logarithmic transformation was applied to the variables that have
2013). FIO concentrations in the beaches are governed by the sour- relatively high skewness to transform them from a lognormallike dis-
ces and the hydrodynamics in the Bay (Ahmadian et al. 2013). tribution to a normal-like distribution. This transformation is neces-
The concentrations of two FIO species, namely E. coli and sary because stepwise MLR models assume normally distributed data
Enterococci, were sampled at various sources and receptors at high (skewness ¼ 0). If the data have a significant skewness, the stepwise
frequency, i.e., intervals of 15–30 min, in years 2011 and 2012. In MLR variable inclusion/exclusion procedures may not be suitable,
Swansea Beach, the large tidal range (exceeding 10 m) and sloping and subsequently the model would not result in good validation. Col-
beach results in a tidal flat exposed up to 1,500 m from shore during linearities among variables were identified, and redundant variables
high spring tides. The large extent of the tidal flats makes single- were removed. In order to build memory of the past conditions,
point FIO concentration measurement impossible. In the data collec- e.g., solar radiation or rain prior to the simulation, and time required
tion scheme, FIO concentrations were measured along a sampling for transport of bacteria across the bay, which could significantly af-
transect consisting of designated sampling points (DSPs) in Swansea fect the concentration of bacteria, time-lagged variables were also
Bay, as shown in Fig. 2(a). Fig. 2(b) shows the DSPs in the sampling considered as the input variables in the data-driven models. Correla-
transect in the 2011 bathing season. The samples were collected in tion analysis was conducted to identify collinear variables and deter-
sterile 1-L containers (Aurora Scientific, Bristol, UK) and stored in a mine and lag time that did not cause additional collinearity issues.
Fig. 2. (a) Site layout and key sampling locations in Swansea Bay, UK (sources: Esri, DigitalGlobe, GeoEye, i-cubed, USDA FSA, USGS, AEX,
Getmapping, Aerogrid, IGN, IGP, swisstopo, and the GIS User Community); and (b) close-up of DSPs (circles) in the sampling transect in plot (a) in
the 2011 bathing season at 30-min intervals from 07:00 to 16:00 (image © Google, Image © 2022 TerraMetrics).
Table 1. Measured FIO concentrations during June 22–September 28, Results and Discussion
2011
Range after Input Variable Selection
transformation
Ln From the correlation analysis, 23 input variables were identified as
FIO variables transformation Minimum Maximum
shown in the Variables Identified from the Correlation Analysis
E. coli (cfu=100 mL) Yes 1.10 8.04 column in Table 3. Only one single representative streamflow,
Enterococci (cfu=100 mL) Yes 1.10 8.37 the flow of the Tawe River, was selected because flows at different
Note: Ln = natural logarithm. streams with no time lag were found highly correlated. Such a high
correlation could be explained by the small size of catchment as- Turbidity and salinity were also found highly correlated with the
sociated with each stream, which means all streams are influenced streamflow and thus eliminated from the data set. This is consistent
with similar weather, and particularly rainfall, patterns. Burton et al. with the idea of Thoe and Lee (2014) that salinity reflects the mix-
(2013) reported that the spatial correlation of rainfall at the site re- ing between riverine freshwater and the ambient seawater. The
mains higher than 0.5 for two points that are 100 km apart, imply- correlation analysis also showed high correlations between a time-
ing that the rainfall is correlated within a 100 × 100 ¼ 10,000-km2 lagged variable and the same variable with no time lag if the time
area. This area is larger than the sum of the watersheds of three ma- lag was not sufficiently long as expected. The correlation between
jor rivers discharging to the Bay (506.4 km2 ), namely the watershed the no-time-lag and time-lagged streamflow remained high (>0.6)
for River Tawe (227.7 km2 ), River Neath (190.9 km2 ), and River for a lag time of 0.25–36 h; only the streamflow with 10-h lag was
Afan (87.8 km2 ). selected following hydrodynamic model results (Lam and Ahmadian
2022).
greater than 0.25 h; it is because rainfall was not expected to have tant variable in other data-driven models for other nearshore coastal
an immediate effect on FIO concentrations from physical process waters (Crowther et al. 2001; Nevers and Whitman 2005; He and
point of view. He 2008; Zhang et al. 2012). Wind was also shown important for
Table 3 provides the predictive variables identified by the FIO concentration by both predictive variable identification meth-
Gamma-GA tests and stepwise MLRs. For consistency of compari- ods. It is consistent with the stepwise MLR results of Wyer et al.
son between the methods and prevent overparamatization, both (2013b). Streamflow, as a known FIO source (e.g., Wyer et al.
Gamma-GA tests and stepwise MLRs were constrained to choose 2010, 2013a; Lam and Ahmadian 2022), was included only by the
a maximum of eight variables. Ideally, an interpretability analysis Gamma-GA model for Enterococci and stepwise MLR for E. coli,
of the variables identified by Gamma-GA tests and stepwise MLRs but the variable was not included for other tests. This is attributed to
is desirable to assess the performance of Gamma-GA tests, but such the small spatial and temporal scale (in a watershed of about
Fig. 4. MSEs for the (a) training; (b) validation; and (c) testing sets of Fig. 5. MSEs for the (a) training; (b) validation; and (c) testing sets of
GG-ANN and SL-ANN models versus number of hidden layer nodes GG-ANN and SL-ANN models versus number of hidden layer nodes
for Enterococci, Realization 3. for E. coli, Realization 1.
M-testing was conducted for variables identified by the Gamma- models gave similar MSEs. The fact that GG-ANN and SL-
GA tests and stepwise MLRs to determine the data length needed ANN models gave similar MSEs does not conflict with the
for model training. Fig. 3 shows that the Gamma-GA tests selected M-test results. Although the M-test results suggested that the
variables that achieve lower (i.e., better) jΓj compared with the Gamma-GA tests identified variables that had the potential to
stepwise MLRs given a sufficiently long data (e.g., beyond 500 achieve lower MSEs, the M-test does not specify the nonlinear
data points), which means that the data length for model training model that gives such results. It is possible that nonlinear mod-
should be greater than 200 for Gamma-GA test to give better results
els other than ANN give better results with the Gamma-GA
compared with stepwise MLR. Following the M-test, the ratio of
identified variables; however, this is out of the scope of this
data points in training, validation, and testing sets was 0.6∶0.2∶0.2,
giving 949 × 0.6 ¼ 571 data points in the training set, which sat- study. For further comparison between models developed from
isfies the minimum of 200 data point requirement imposed by the the Gamma-GA tests and stepwise MLRs, networks with 1 to 50
M-test. The mean and standard deviation values of the training, val- nodes in the hidden layer were tested, and the number of nodes
idation, and testing data sets were checked to be approximately in the hidden layer was selected based on providing the lowest
comparable in all three realizations. validation MSE.
Fig. 6. Regressions between target Enterococci concentrations and GG-ANN model outputs for (a) training; (b) validation; (c) testing; and (d) all data
sets, Realization 3. LCC = linear correlation coefficient.
Fig. 7. Regressions between target E. coli concentrations and GG-ANN model outputs for (a) training; (b) validation; (c) testing; and (d) all data sets,
Realization 1.
Mean Squared Error and R2 MLR chooses variables that optimize linear model performance
Figs. 6 and 7 show the comparison between GG-ANN model re- compared with the Gamma-GA test.
sults and target FIO concentrations for training, validation, and test
data sets, as well as all data. Tables 4 and 5 present the comparison Performance Tables
among GG-ANN, SL-ANN, GG-Linear, and SL-Linear models.
For most ANN models, the optimal SL-ANN models consisted The ability to identify the most hazardous circumstances, namely
of fewer hidden layer nodes compared with GG-ANN models, poor water quality conditions, is particularly important when a
real-time predictive model is used as an early warning system. The
which is consistent with the “Selection of Number of Nodes”
EU rBWD (European Commission 2006) considers the water quality
section.
in a bathing site to be poor if the 90-percentile FIO concentration in
GG-ANN and SL-ANN models gave better MSE and R2 than
the lognormal distribution obtained from the last assessment period
GG-Linear and SL-Linear models. This shows the capacity of non-
(usually the last four bathing seasons) exceeds a given threshold. The
linear models in capturing inherent nonlinear relationships between threshold is 185 colony-forming unit per 100 mL (cfu=100 mL)
variables and FIO concentrations. GG-ANN models gave better for Enterococci and 500 cfu=100 mL for E. coli. In this study,
training, validation, and testing results than SL-ANN models for the use of 90-percentile values was not sensible because water qual-
Enterococci, but SL-ANN models gave better validation and test- ity is being predicted at a 30-min interval. To test the models’ ability
ing results for E. coli. The better GG-ANN performance for Entero- to identify poor water quality events, individual Enterococci and
cocci can be explained by the fact that GG-ANN better captures E. coli concentration values were compared with the 185 and
extreme FIO concentrations, as illustrated in the “Performance 500 cfu=100 mL thresholds, respectively. Fig. 8 shows the perfor-
Table” section. This GG-ANN property helps the models perform mance tables of the data-driven models in correctly predicting poor
better for Enterococci because there are more extreme values for water quality under the EU rBWD classification for the testing sets.
the data series of Enterococci (17.8% of the data was below 0.1 In this context, sensitivity and specificity are defined as follows:
or above 0.9) compared with E. coli (8.9% of the data was below
0.1 or above 0.9). The MSE of SL-Linear models was better than Correctly predicted poor water quality
Sensitivity ¼ ð14Þ
the one of the GG-Linear models, verifying the fact that stepwise Observed poor water quality
Table 5. MSE and unadjusted R2 between computed and measured E. coli concentrations
MSE R2
Hidden layer
Model node number Training Validation Testing Training Validation Testing
Realization 1
GG-ANN 25 0.0067 0.0159 0.0214 0.8396 0.6077 0.5312
SL-ANN 18 0.0096 0.0154 0.0174 0.7705 0.6198 0.6177
GG-linear N/A 0.0325 0.0342 0.2166 0.2482
SL-linear N/A 0.0297 0.0305 0.2838 0.3290
Realization 2
GG-ANN 13 0.0119 0.0178 0.0205 0.7208 0.5939 0.4801
SL-ANN 18 0.0113 0.0135 0.0188 0.7370 0.6914 0.5221
GG-linear N/A 0.0331 0.0315 0.2294 0.1996
SL-linear N/A 0.0299 0.0295 0.3041 0.2484
Realization 3
GG-ANN 25 0.0073 0.0146 0.0196 0.8066 0.6747 0.6337
SL-ANN 27 0.0091 0.0155 0.0186 0.7582 0.6539 0.6501
GG-linear N/A 0.0321 0.0363 0.1873 0.3181
SL-linear N/A 0.0297 0.0310 0.2501 0.4167
Note: Bold indicates the best results for each realization.
Correctly predicted not poor water quality The sensitivity of SL-Linear models was better than the one of GG-
Specificity ¼ ð15Þ
Observed not poor water quality Linear models as expected from the MSE and R2 results. Specific-
ities for all the models tested were always higher than 90%. This is
To explain, sensitivity represents the likelihood that a poor-water- probably due to the small proportion of poor-water-quality events
quality event is correctly predicted. Specificity represents the like- during the study period.
lihood that a not-poor-water-quality event is correctly predicted. It Sensitivity was higher for Enterococci concentrations than for
can be alternatively interpreted as a minus false alarm rate. E. coli in Fig. 8 for GG-ANN models. The same was observed in
Being consistent with the result that the ANN models gave bet- the performance tables for the entire data set (949 data points). An
ter MSE and R2 values, the ANN models gave significantly higher explanation can be given from the probability distribution of the
sensitivity than the linear regression models: 24%–62% and 0%–
Enterococci and E. coli data. A chi-square test showed that both
14%, respectively. The observation that nonlinear models give
data follow lognormal distributions with a confidence level of
more accurate FIO predictions is consistent with Thoe et al. (2015).
The results are also consistent with those of Zhang et al. (2012) that above 95% if very small values (<3 cfu=100 mL) were removed.
ANNs capture extreme FIO values better than linear regression From the respective probability distribution functions, the exceed-
models. Sensitivities of GG-ANN models were higher than the ance probability of the Enterococci threshold (15.9%) was higher
ones for SL-ANN models for both Enterococci and E. coli, despite than the one of E. coli (5.5%) from the entire data set. With a lower
the MSEs for SL-ANN models being lower than the ones for GG- exceedance probability, models that better capture extreme values
ANN models for E. coli. It suggests that GG-ANN models better are required to achieve a better sensitivity for E. coli compared with
capture high FIO concentrations compared with SL-ANN models. Enterococci concentrations.
Observed Observed
Not poor Poor Not poor Poor
Predicted
Predicted
Not poor 146 11 93% Not poor 170 6 97%
Observed Observed
Not poor Poor Not poor Poor
Predicted
Predicted
Not poor 155 22 88% Not poor 172 8 96%
Observed Observed
Not poor Poor Not poor Poor
Predicted
Predicted
Observed Observed
Not poor Poor Not poor Poor
Predicted
Predicted
Fig. 8. Performance tables for data-driven models: (a and b) GG-ANN models; (c and d) SL-ANN models; (e and f) GG-Linear models; and
(g and h) SL-Linear models. Plots (a, c, e, and g) show Enterococci Realization 1 testing sets, and plots (b, d, f, and h) show E. coli Realization
3 testing sets.
Discussion models, this study confirms their findings with the same number of
explanatory variables used in the ANN and linear models. This dem-
Gamma testing is a promising tool to identify input data for a data- onstrates the importance of including nonlinearity in capturing high
driven model because it is nonlinear, and it does not require a re- FIO concentrations. The effect of nonlinearity of ANN was also re-
gression equation a priori. Nevertheless, these advantages do not flected in higher sensitivities of GG-ANN and SL-ANN models
imply that gamma tests can be applied with no knowledge about the compared with GG-Linear and SL-Linear models.
waterbody or the data to which the test is applied. In this paper, a Comparing GG-ANN and SL-ANN models, GG-ANN mod-
correlation analysis was conducted prior to the Gamma-GA tests to els gave better results for Enterococci for all training, validation,
remove highly correlated candidate predictive variables. and testing sets and most of the training sets for E. coli. Although
The GG-ANN and SL-ANN models fitted better to the mea- SL-ANN models gave better testing results for E. coli, GG-ANN
sured FIO concentrations and captured better the extreme FIO con- models gave higher sensitivities for both Enterococci and E. coli,
centrations compared with GG-Linear and SL-Linear models. This showing that Gamma-GA models select variables that capture bet-
is consistent with Zhang et al. (2012) for FIO concentrations and ter extreme FIO concentrations compared with the stepwise MLR.
Keiner and Yan (1998) for chlorophyll-a and suspended sediments. The results suggest that GG-ANN models are more suitable for
Whereas Zhang et al. (2012) arrived at this conclusion by comparing bathing water quality warning applications in which predicting high
15-variable ANN models to five- or six-variable linear regression FIO concentrations is the major concern.
2018). The models used in this paper generally gave lower R2 than Development Fund under EAPA 1058/2018.
the daily predictions of water quality in the literature. This high-
lights the difficulty of short-term water quality prediction; further
study of the effect of time-scale on prediction accuracy is needed. Notation
Although Zhang et al. (2018) attempted to predict water quality
with ANN at different time-scales, their results were not conclu- The following symbols are used in this paper:
sive. Nevertheless, this paper highlights the potential for a combi- A = slope for the gamma test regression equation;
nation of Gamma-GA test and a nonlinear predictive model to give fð·Þ = nonlinear smooth function relating xi and yi ;
timely bathing water quality prediction. This can be used as early M = total number of time instants of the data;
warning systems on impending poor water quality together with r = random variable (noise);
real-time environmental sensors. X = linearly independent environmental variables;
xi = linearly independent environmental variables at time
instant i (i.e., the data point at i);
Conclusion xi0 = imaginary environmental variables at time instant i;
This paper developed a data-driven model for FIO concentration xj = jth environmental variable;
prediction with only limited number of critical and without unnec- xj;max = maximum value of variable xj ;
essary input variables. The performance of the Gamma-GA test as a xj;min = minimum value of variable xj ;
tool for predictive variable identification of Enterococci and E. coli xj;nor = normalized xj ;
at an interval of 30 min was evaluated. ANN and linear regression xN½i;k = kth nearest data point to xi ;
models were developed from the variables identified from Gamma- y = target data (FIO concentrations);
GA tests and stepwise MLR for comparison. The GG-ANN models yN½i;k = target data value associated with xN½i;k ;
gave better results for Enterococci for all training, validation, and γ = estimate of variability among xi , 1 ≤ i ≤ M;
testing sets and most of the training sets for E. coli. Γ = intercept for the gamma test regression; and
The results also demonstrated the potential for Gamma-GA δ = estimate of variability among yi , 1 ≤ i ≤ M.
tests to identify variables that give a better model compared with
stepwise MLR. Although SL-ANN models usually gave better
MSE and R2 for testing results of E. coli, the GG-ANN model was
References
better in identifying events of poor water quality. This illustrates
the merit of nonlinear variable identification approach—the varia- Abu-Bakar, A., R. Ahmadian, and R. A. Falconer. 2017. “Modelling the
bles identified are more capable of predicting high FIO concentra- transport and decay processes of microbial tracers in a macro-tidal es-
tions. Therefore, the GG-ANN model is more suitable for bathing tuary.” Water Res. 123 (4): 802–824. https://doi.org/10.1016/j.watres
water warning applications in which predicting high FIO concen- .2017.07.007.
trations is the major concern. For the two variable identification Ahmadian, R., S. Bomminayuni, R. Falconer, and T. Stoesser. 2013.
approaches, ANN models were better than linear regression mod- Numerical modelling of flow and faecal indicator organism transport
els in terms of MSE and R2 as well as sensitivity. This result again at Swansea Bay, UK. Technical Rep. for the Interreg 4a Smart Coasts—
Sustainable Communities Project. Cardiff, Wales: Cardiff Univ.
highlighted the importance of including nonlinear effects in predic-
Belsley, D. A., E. Kuh, and R. E. Welsch. 1980. Regression diagnostics:
tion models.
Identifying influential data and sources of collinearity. New York:
In conclusion, this paper demonstrated the potential of combin- Wiley.
ing the Gamma-GA test and ANN to predict bathing water quality. Bentley, J. 1975. “Multidimensional binary search trees used for associa-
Prior to the variable identification tests, a correlation analysis was tive search.” Commun. ACM 18 (9): 509–517. https://doi.org/10.1145
conducted to remove redundant variables in the data set. The need /361002.361007.
of such an analysis illustrates the importance of understanding the Boehm, A. B., S. B. Grant, J. H. Kim, S. L. Mowbray, C. D. McGee, C. D.
data set in the development and application of data-driven models. Clark, D. M. Foley, and D. E. Wellman. 2002. “Decadal and shorter
period variability of surf zone water quality at Huntington Beach,
California.” Environ. Sci. Technol. 36 (18): 3885–3892. https://doi
.org/10.1021/es020524u.
Data Availability Statement Burton, A., V. Glenis, M. R. Jones, and C. G. Kilsby. 2013. “Models
of daily rainfall cross-correlation for the United Kingdom.” Environ.
Some data, models, or codes that support the findings of this study Modell. Software 49 (Feb): 22–33. https://doi.org/10.1016/j.envsoft
are available from the corresponding author upon reasonable re- .2013.06.001.
quest (MATLAB codes for Gamma-GA tests and the ANN models Choubin, B., and A. Malekian. 2017. “Combined gamma and m-test-based
and model outputs). The bacteria and environmental data were col- ANN and ARIMA models for groundwater fluctuation forecasting in
lected by Aberystwyth University, Swansea City Council, and Nat- semiarid regions.” Environ. Earth Sci. 76 (7): 538. https://doi.org/10
ural Resources Wales (Environmental Protection and Regulatory .1007/s12665-017-6870-8.
.1016/j.amc.2009.02.044.
Dufour, A. P. 1984. “Bacterial indicators of recreational water quality.” 729–740. https://doi.org/10.1016/j.envsoft.2007.09.009.
Can. J. Public Health 75 (1): 49–56. Masters, T. 1993. Practical neural network recipes in C++. San Diego:
European Commission. 2006. “Directive 2006/7/EC of the European Academic Press.
parliament and of the council of 15 February 2006 concerning the man- Mathworks. 2020a. MATLAB deep learning toolbox user’s guide. Boston:
agement of bathing water quality and repealing directive 76/160/EEC.” Mathworks.
OJEU 64 (40): 37–51. Mathworks. 2020b. MATLAB global optimization toolbox user’s guide.
Evans, D., and A. J. Jones. 2002. “A proof of the gamma test.” Proc. R. Soc. Boston: Mathworks.
London, Ser. A 458 (2): 2759–2799. https://doi.org/10.1098/rspa.2002 Nevers, M. B., and R. L. Whitman. 2005. “Nowcast modeling of Escher-
.1010. ichia coli concentrations at multiple urban beaches of southern Lake
Evans, G. P., B. M. Mollowney, and N. C. Spoel. 1990. “Two-dimensional Michigan.” Water Res. 39 (5): 5250–5260. https://doi.org/10.1016/j
modelling of the Bristol Channel, UK.” In Proc., Conf. on Estuarine .watres.2005.10.012.
and Coastal Modelling, edited by M. L. Spaulding, 331–340. Reston, Pandey, P. K., P. H. Kass, M. L. Soupir, S. Biswas, and V. Singh. 2014.
VA: ASCE. “Contamination of water resources by pathogenic bacteria.” AMB
Garrett, J. J. 1994. “Where and why artificial neural networks are applicable Express 4 (51): 5250–5260. https://doi.org/10.1186/s13568-014
in civil engineering.” J. Comput. Civ. Eng. 8 (2): 129–130. https://doi -0051-x.
.org/10.1061/(ASCE)0887-3801(1994)8:2(129). Pruss, A. 1998. “Review of epidemiological studies on health effects from
Ghaderi, K., B. Motamedvaziri, M. Vafakhah, and A. A. Dehghani. 2019. exposure to recreational water.” Int. J. Epidemiol. 27 (4): 1–9. https://
doi.org/10.1093/ije/27.1.1.
“Regional flood frequency modeling: A comparative study among sev-
eral data-driven models.” Arab. J. Geosci. 12 (18): 588. https://doi.org Russell, S. J., and P. Norvig. 2010. Artificial intelligence—A modern
/10.1007/s12517-019-4756-7. approach. 3rd ed. London: Pearson.
Schippmann, B., G. Schernewhki, and U. Grawe. 2013. “Escherichia coli
Gonzalez, R. A., K. E. Conn, J. R. Crosswell, and R. T. Noble. 2012. “Ap-
pollution in a Baltic Sea lagoon: A model-based source and spatial risk
plication of empirical predictive modeling using conventional and alter-
assessment.” Int. J. Hyg. Environ. Health 216 (4): 408–420. https://doi
native fecal indicator bacteria in eastern North Carolina waters.” Water
.org/10.1016/j.ijheh.2012.12.012.
Res. 46 (18): 5871–5882. https://doi.org/10.1016/j.watres.2012.07.050.
Stefansson, A., N. Koncar, and A. J. Jones. 1997. “A note on the gamma
Gonzalez, R. A., and R. T. Noble. 2014. “Comparisons of statistical models
test.” Neural Comput. Appl. 5 (14): 131–133. https://doi.org/10.1007
to predict fecal indicator bacteria concentrations enumerated by QPCR-
/BF01413858.
and culture-based methods.” Water Res. 48 (Jan): 296–305. https://doi
Stone, M. 1974. “Cross-validatory choice and assessment of statistical pre-
.org/10.1016/j.watres.2013.09.038.
dictions.” J. R. Stat. Soc. Ser. B 36 (4): 117–147. https://doi.org/10.1111
He, L. M., and Z. L. He. 2008. “Water quality prediction of marine recrea-
/j.2517-6161.1974.tb00994.x.
tional beaches receiving watershed baseflow and stormwater runoff in
Thoe, W., M. Gold, A. Griesbach, M. Grimmer, M. L. Taggart, and
southern California, USA.” Water Res. 42 (10–11): 2563–2573. https://
A. B. Boehm. 2015. “Sunny with a chance of gastroenteritis: Predicting
doi.org/10.1016/j.watres.2008.01.002.
swimmer risk at California beaches.” Environ. Sci. Technol. 49 (4):
Huang, G., R. A. Falconer, and B. Lin. 2017. “Integrated hydro-bacterial 423–431. https://doi.org/10.1021/es504701j.
modelling for predicting bathing water quality.” Estuarine Coastal Shelf Thoe, W., and J. H. W. Lee. 2014. “Daily forecasting of Hong Kong
Sci. 188 (Jan): 145–155. https://doi.org/10.1016/j.ecss.2017.01.018. beach water quality by multiple linear regression models.” J. Environ.
Iyer, M. S., and R. R. Rhinehart. 1999. “A method to determine the required Eng. 140 (2): 04013007. https://doi.org/10.1061/(ASCE)EE.1943-7870
number of neural-network training repetitions.” IEEE Trans. Neural .0000800.
Networks 10 (2): 427–432. https://doi.org/10.1109/72.750573. Uncles, R. J. 1981. “A numerical simulation of the vertical and horizontal
Jin, G., and A. J. Englande. 2006. “Prediction of swimmability in a brackish M2 tide in the Bristol Channel and comparisons with observed data.”
water body.” Manage. Environ. Qual. 17 (2): 197–208. https://doi.org Limnol. Oceanogr. 26 (3): 571–577. https://doi.org/10.4319/lo.1981.26
/10.1108/14777830610650500. .3.0571.
Jones, A. J. 2004. “New tools in non-linear modelling and prediction.” CMS USEPA. 2010. Predictive tools for beach notification, vol I: Review and
1 (Jun): 109–149. https://doi.org/10.1007/s10287-003-0006-1. technical protocol. Rep. No. EPA-823-R-10-003. Washington, DC:
Kashefipour, S. M., B. Lin, and R. A. Falconer. 2005. “Neural networks for USEPA.
predicting seawater bacterial levels.” Water Manage. 158 (3): 111–118. Whitman, R. L., M. B. Nevers, G. C. Korinek, and M. N. Byappanahalli.
https://doi.org/10.1680/wama.2005.158.3.111. 2004. “Solar and temporal effects on Escherichia coli concentration at a
Keiner, L. E., and X. Yan. 1998. “A neural network model for estimating Lake Michigan swimming beach.” Appl. Environ. Microbiol. 70 (7):
sea surface chlorophyll and sediments from thematic mapper imagery.” 4276–4285. https://doi.org/10.1128/AEM.70.7.4276-4285.2004.
Remote Sens. Environ. 66 (98): 153–165. https://doi.org/10.1016/S0034 Wyer, M. D., et al. 2013a. Faecal indicator source connectivity for inputs to
-4257(98)00054-6. Swansea Bay, south Wales, UK. Technical Rep. for the Interreg 4a
Kim, J. H., S. B. Grant, C. D. McGee, B. F. Sanders, and J. L. Largier. 2004. Smart Coasts—Sustainable Communities Project. Aberystwyth, Wales:
“Locating sources of surf zone pollution: A mass budget analysis of Aberystwyth Univ.
fecal indicator bacteria at Huntington Beach, California.” Environ. Sci. Wyer, M. D., et al. 2013b. Statistical modelling of faecal indicator organisms
Technol. 38 (9): 2626–2636. https://doi.org/10.1021/es034831r. at a marine bathing water site: Results of an intensive study at Swansea
King, J., R. Ahmadian, and R. Falconer. 2021. “Hydro-epidemiological Bay, UK. Technical Rep. for the Interreg 4a Smart Coasts—Sustainable
modelling of bacterial transport and decay in nearshore coastal waters.” Communities Project. Aberystwyth, Wales: Aberystwyth Univ.
Water Res. 196 (Nov): 117049. https://doi.org/10.1016/j.watres.2021 Wyer, M. D., D. Kay, H. Morgan, S. Naylor, S. Clark, J. Watkins, C. M.
.117049. Davies, C. Francis, H. Osborn, and S. Bennett. 2018. “Within-day