0% found this document useful (0 votes)
156 views12 pages

Machine Learning in Environmental Science and Engineering

This document discusses applying machine learning techniques to environmental science and engineering. It begins by defining machine learning and the different types, including supervised learning, unsupervised learning, classification algorithms, and regression algorithms. It then provides examples of how machine learning can be used, such as environmental niche modeling to predict species distributions, predicting pollution levels, and predicting deforestation. The document emphasizes that machine learning has the power to help deal with complex environmental issues more efficiently.

Uploaded by

shreyash gupta
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
156 views12 pages

Machine Learning in Environmental Science and Engineering

This document discusses applying machine learning techniques to environmental science and engineering. It begins by defining machine learning and the different types, including supervised learning, unsupervised learning, classification algorithms, and regression algorithms. It then provides examples of how machine learning can be used, such as environmental niche modeling to predict species distributions, predicting pollution levels, and predicting deforestation. The document emphasizes that machine learning has the power to help deal with complex environmental issues more efficiently.

Uploaded by

shreyash gupta
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 12

APPLICATION MACHINE LEARNING TO

ENVIRONMENT SCIENCE AND ENGINEERING


SHREYASH GUPTA

1RV18CS161

Abstract-In this paper we discuss various ways known as "training data", in order to make
in which machine learning and artificial predictions or decisions without being explicitly
intelligence can be applied in fields of programmed to perform the task.As it is evident
environment science an engineering like from the name, it gives the computer that which
makes it more similar to humans: The ability to
predicting pollution, predicting risks of
learn. Machine learning is actively being used
exposure of chemicals to humans, animals
today, perhaps in many more places than one
without threatening anyone’s life, predicting
would expect. ML is broadly classified into two
deforestation, model the spread of invasive
broad categories- supervised learning and
species etc. Anything related to complex
unsupervised learning.
decision making that has quantifiable
variables can be solved with the help of Machine learning tasks are classified into several
operations research tools and machine broad categories. In supervised learning, the
learning.Future scope of such techniques are algorithm builds a mathematical model from a
also discussed in this paper set of data that contains both the inputs and the
desired outputs. For example, if the task were
determining whether an image contained a
What is Machine Learning ? certain object, the training data for a supervised
learning algorithm would include images with
and without that object (the input), and each
Machine learning (ML) is the scientific
image would have a label (the output)
study of algorithms and statistical
designating whether it contained the object. In
models that computer systems use to perform a
special cases, the input may be only partially
specific task without using explicit instructions,
available, or restricted to special
relying on patterns and inference instead. It is
feedback.Semi-supervised learning algorithms
seen as a subset of artificial intelligence.
develop mathematical models from incomplete
Machine learning algorithms build
a mathematical model based on sample data,
training data, where a portion of the sample input we optimize it such that it works correctly on
doesn't have labels. unseen data/new data.

Classification algorithms Deep learning comes under machine learning,


and regression algorithms are types of where we use neural networks which is inspired
supervised learning. Classification algorithms by networks in our brain. It consists of neurons
are used when the outputs are restricted to which get fired based on outputs of previous
a limited set of values. For a classification neurons.
algorithm that filters emails, the input would be
an incoming email, and the output would be the
The image above sums up how ml works.
name of the folder in which to file the email. For
an algorithm that identifies spam emails, the
output would be the prediction of either "spam"
or "not spam", represented by
the Boolean values true and
false. Regression algorithms are named for their
continuous outputs, meaning they may have any
value within a range. Examples of a continuous
value are the temperature, length, or price of an
object.

In unsupervised learning, the algorithm builds a


mathematical model from a set of data which
contains only inputs and no desired output labels. Why machine learning?
Unsupervised learning algorithms are used to
find structure in the data, like grouping
or clustering of data points. Unsupervised Machine learning techniques already

learning can discover patterns in the data, and outperform human volunteers in several

can group the inputs into categories, as in feature conservational activities, speeding-up

learning. Dimensionality reduction is the environmental-protection efforts and

process of reducing the number of "features", or maximizing the resources available.

inputs, in a set of data For example-Automated animal identification

In a nutshell, we develop an hypothesis in which performs at the same 96.6% accuracy level of

works correctly in predicting training data, then human volunteers, saving approximately 8.2
years of human labeling effort on a 3.2-
million-image data set.
1.) Environmental (or ecological) niche
Due to their powerful nonlinear modeling modelling (ENM)
capability, machine learning methods today are
used in satellite data processing, general Species distribution modelling (SDM), also

circulation models(GCM), weather and climate known as environmental (or ecological) niche

prediction, air quality forecasting, analysis and modelling (ENM), habitat modelling, predictive

modeling of environmental data, oceanographic habitat distribution modelling, and range

and hydrological forecasting, ecological mapping uses computer algorithms to predict

modeling, and monitoring of snow, ice and the distribution of a species

forests. across geographic space and time using


environmental data. The environmental data are
In this world, where we no longer program
most often climate data (e.g. temperature,
machines, but teach them, this means more than
precipitation), but can include other variables
just filling in Excel tables and writing reports.
such as soil type, water depth, and land cover.
In fact, environmental scientists can leverage
artificial intelligence to deal better and faster SDMs are used in several research areas

with environmental issues, like global warming. in conservation biology, ecology and evolution.

Cognitive computing has the power to


completely revolutionize their workflow,
making it more efficient in terms of time,
resources, and accuracy.

There are many instances where a simple


program cannot predict future about what is
going to happen. Here machine learning and
artificial intelligence comes into play. ML
algorithms can predict future upto a great extent.
These models can be used to understand how
Algorithm’s such a Anomaly detection can
environmental conditions influence the
predict anomaly’s even is cases which have
occurrence or abundance of a species, and for
never occurred before. Now we get an idea of
predictive purposes (ecological forecasting).
how ml works and why will it be useful, we can
Predictions from an SDM may be of a species’
discuss about various applications where
future distribution under climate change, a
machine learning can be used.
species’ past distribution in order to assess
evolutionary relationships, or the potential a number of factors, including the nature,
future distribution of an invasive species. complexity, and accuracy of the models used
Predictions of current and/or future habitat and the quality of the available environmental
suitability can be useful for management data layers; the availability of sufficient and
applications (e.g. reintroduction or translocation reliable species distribution data as model input;
of vulnerable species, reserve placement in and the influence of various factors such as
anticipation of climate change). barriers to dispersal, geologic history, or biotic
interactions, that increase the difference
There are a variety of mathematical methods
between the realized niche and the fundamental
that can be used for fitting, selecting, and
niche. Environmental niche modelling may be
evaluating correlative SDMs.An incomplete list
considered a part of the discipline
of algorithms that have been used for niche
of biodiversity informatics
modelling includes:
.

 Artificial Neural Networks (ANNs)


2.) Species identification
 Boosted Regression Trees

 Random forest (RF)


Identifying taxa can require specialized
 Support vector machines (SVM) knowledge only possessed by a very few and
the data set requiring expert curation can be
large (e.g., automated collection of images and
There are two main types of SDMs. Correlative
sounds). Thus, the expert annotation step is a
SDMs, also known as climate envelope
major bottleneck in biodiversity studies. In
models, bio-climatic models, or resource
order to increase throughput, algorithms are
selection function models, model the observed
trained on images, sounds, and other types of
distribution of a species as a function of
data labeled with taxon names. (For more in
environmental conditions. Mechanistic SDMs,
also known as process-based
models or biophysical models, use
independently derived information about a
species' physiology to develop a model of the
environmental conditions under which the
species can exist.

The extent to which such modelled data reflect


real-world species distributions will depend on
1999b). Common tools include support vector
machines (Fagerlund 2007, Sosik and Olson
2007, Acevedo et al. 2009, Armitage and Ober
2010, Goodwin et al. 2014, Rosa et al. 2015),
Random Forest (Armitage and Ober 2010, Rosa
et al. 2015), Bayesian classifiers (Fielding
1999a, Wang et al. 2007), genetic algorithms
(Jeffers 1999), and neural networks (Balfoort et
al. 1992, Boddy et al. 1994, Simmonds et al.

formation about automated taxon identification 1996, Do et al. 1999, Parsons and Jones 2000,

specifically, see Edwards et al. 1987 and Jennings et al. 2008, Armitage and Ober 2010,

MacLeod 2007). The trained algorithms can Rosa et al. 2015).

then automatically annotate new data. This


technique has been used to identify plankton,
spiders, and shellfish larvae from images
3.) Locating potentially polluting
(Boddy and Morris 1999, Do et al. 1999, Sosik animal farms.
and Olson 2007, Goodwin et al. 2014).
Bacterial taxa have been identified from gene
How to locate potentially polluting animal
sequences (Wang et al. 2007). Audio files of
farms has long been a problem for
amphibian, bird, bat, insect, elephant, cetacean,
environmental regulators. Now, Stanford
and deer sounds have been classified to species
scholars show how a map-reading algorithm
(Parsons and Jones 2000 Jennings et al. 2008,
could help regulators identify facilities more
Chesmore 2004, Acevedo et al. 2009, Armitage
efficiently than ever before.
and Ober 2010, Kasten et al. 2010). Fish and
algal species have been identified using acoustic Law Professor Daniel Ho, along with doctoral
(Simmonds et al. 1996) and optical candidate Cassandra Handan-Nader, have
characteristics (Balfoort et al. 1992, Boddy et al. figured out a way for machine learning –
1994). ML has been used to differentiate teaching a computer how to identify and
between the radar signals of birds and abiotic analyze patterns in data – to efficiently locate
objects (Rosa et al. 2015). In some cases, industrial animal operations and help regulators
individuals of the same species can be determine each facility’s environmental risk.
distinguished even if the individuals themselves The researchers’ findings are set to publish
are unknown a priori (Reby et al. 1998, Fielding April 8 in Nature Sustainability.
According to the Environmental Protection With no definite list to turn to, efforts to
Agency (EPA), agriculture is the leading monitor potentially polluting facilities are
contributor of pollutants into the nation’s water difficult and, in some cases, impossible.
supply, with substantial pollution believed to be
To solve this issue big data comes into use.
emanating from large-scale, concentrated
Deep learning was used to solve this problem.
animal feeding operations, known also as
Deep learning algorithms have revolutionized
CAFOs.
the ability to detect complex objects in imagery.
But environmental monitoring efforts have been
With the help of several open source tools and a
stymied by a basic problem: Regulators have no
team of students in economics and computer
systematic way of determining where CAFOs
science to assist with data analysis, Ho and
are located, Ho said. The United States
Handan-Nader were able to retrain an existing
image-recognition model to recognize
large-scale animal facilities by using
information collected by two nonprofit groups
and publicly available satellite images from the
USDA’s National Agricultural Imagery
Program (NAIP). The researchers focused on
trying to identify poultry facilities in North
Carolina because most are not required to obtain
permits, Ho said.

The model, already savvy in scanning images


based on an enormous corpus of digital images,
Government Accountability Office reports that
was retrained to pick up on similar clues that the
no federal agency has reliable information on
environmental organizations had been manually
the number, size and location of large-scale
monitoring. For example, swine farms were
agricultural operations.
identifiable by compact rectangular barns
While the Clean Water Act does require some abutted by large liquid manure pits, and poultry
federal permitting, it only applies to operations by long rectangular barns and dry manure
that actually discharge pollutants into U.S. storage. By homing in on these prominent
waterways – not facilities that could potentially features, the model was also able to provide size
cause contamination – intentionally or not, Ho estimates for the facilities.
said.
The researchers found that their algorithm was fronts, tidal changes and atmospheric rivers,
able to identify 15 percent more poultry farms which can cause heavy precipitation and are
than what was originally found through manual often impossible for humans to identify on their
endeavors. And because their approach could own.
scale across years of NAIP imagery, their
Water quality parameters such as dissolved
algorithm was able to accurately estimate
oxygen and turbidity play a key role in policy
growth by identifying any changes, in this case,
decisions regarding the maintenance and use of
a new building construction – a feed mill.
the nation's major bodies of water. In particular,
“The model detected 93 percent of all poultry the United States Geological Survey (USGS)
CAFOs in the area, and was 97 percent accurate maintains a massive suite of sensors throughout
in determining which ones appeared after the the nation's waterways that are used to inform
feed mill opened,” Handan-Nader and Ho write such decisions, with all data made available to
in the paper. the public. However, the corresponding
measurements are regularly corrupted due to
4.)Machine learning for water
sensor faults, fouling, and decalibration, and
monitoring, hydrology and hence USGS scientists are forced to spend
sustainability costly time and resources manually examining
data to look for anomalies.A method was
Fresh water is a limited resource. Industries
presented which automatically detects such
directly to water include agriculture, mining,
events using supervised machine learning. They
forestry, hydro power, waste management etc.
first present an extensive study of which water
Machine learning can automate, simplify and quality parameters can be reliably predicted,
improve many aspects of water monitoring using support vector machines and gradient
including: 1) Improving modeling and analysis boosting algorithms for regression. We then
2) Detecting and correcting equipment show that the trained predictors can be used to
malfunctions 3) Detecting environmental automatically detect sensor de-calibration,
anomalies 4) Predicting the effects of policy providing a system that could be easily
decisions 5) Automating and controlling deployed by the USGS to reduce the resources
allocation and distribution needed to maintain data fidelity.

AI has helped environment researchers clinch 5.)Rainfall-runoff model using an


almost 90 per cent accuracy in spotting climate artificial neural network approach
change factors like tropical cyclones, weather
A runoff model is a rainfall value expected for day t. Accordingly,
mathematical model describing the rainfall– the output vector represents the expected runoff
runoff relations of a rainfall catchment area, value for day t.
drainage basin or watershed. More precisely, it
The database compiled represents seven years
produces a surface runoff hydrograph in
daily sets of rainfail-runoff values for the
response to a rainfall event, represented by and
Ourika Wadi basin. In this paper, data for the
input as a hyetograph.

The ANN model provides a more systematic


approach, reduces the length of calibration data,
and shortens the time spent in calibration of the
models. At the same time, it represents an
improvement upon the prediction accuracy and
flexibility of current methods.

In the present study, the flow and rainfall series


observed in Ourika basin at Aghbalou station in
last year (1996) was used for model testing,
Morocco is analyzed using the ANN model.
while the other remaining data (1990 to 1995)
The Ourika basin is the most important
was used for model training/calibration. The
subcatchment of Tensift basin drainage located
training phase of ANN model was terminated
in semi-arid region of Marrakech, which is
when the average squared error (ASE) on the
draining an area of about 503 km2.
testing databases was minimal. The goal of the
The Rainfall and Runoff daily data at the
training process is to reach an optimal solution
Aghbalou station was used for model
based on some performance measurements such
investigation. The data contains information for
as ASE, coefficient of determination known as
a period of seven years (1990 to 1996). The
R-square value (R2), and the MARE (Mean
entire database is represented by 2550 daily
Absolute Relative Error).
values of rainfall and runoff pairs. The ANN
The comparison between the predicted and
model was trained using the resulting runoff and
actual flow values at training and testing phases
rainfall daily data. The database was collected
show excellent agreement with the R 2 are
by the Rabat hydraulic administration. The
respectively 0,948 and 0,917 (see Figure 3).
input vector is represented by rainfall and runoff
Note that, data pairs closer to the 45 ° line
values for the preceding seven days, (i.e., t - 1, t
represent better prediction cases. The good
- 2, t - 3, t - 4, t - 5, t - 6, t - 7) as well as the
performance and convergence of the model are individual from growing industries to the poor
illustrated in Figure above. villages. Water prediction is a nonlinear and
non-stationary function which are influenced by
The artificial neural network (ANN) models
various factors such as geographical & climatic
show good capability to model hydrological
conditions and water utilization in different
process. They are useful and powerful tools to
fields for various purposes [1]. Therefore,
handle complex problems compared with the
there’s a challenge to build a dynamic, robust
other traditional models. In this study, the
and adaptive forecasting model that can provide
results obtained show clearly that the artificial
the capability to accurately predict the
neural networks are capable of model
on-demand water usage.
rainfall-runoff relationship in the arid and
semiarid regions in which the rainfall and runoff Unlike weather forecasting, water demand
are very irregular, thus, confirming the general prediction is the application that is made by
enhancement achieved by using neural networks collecting quantitative data which are measured
in many other hydrological fields. The results from different consumption units, meters from
and comparative study indicate that the artificial the water distribution network facilitated in
neural network method is more suitable to each region around the world. The recent
predict river runoff than classical regression advancement in the field of machine learning
model. The ANN approach could provide a very (ML) has made a significant computational
useful and accurate tool to solve problems in
water resources studies and management.

6.) Water Demand Prediction

Water is the important factor for the economic


and social development where it plays a vital
role to maintain health, generate energy, the
growth of agricultural products, create
opportunities and manage the environment.
Availability of water management impact each
analysis of Water Distribution Networks of training and testing set. For the processing of
(WDN). the datasets, Radial Basis Function (RBF)
Kernel is used.

In this experiment, the data collected is used to


The data model flow of the water demand
determine the 10 days estimation of test dataset.
prediction is illustrated in Fig. 1. In the given
At first, 80 days observations were used for
industry, consumption units record the water
training model and the remaining was chosen as
used for processing which has been used as the
the testing dataset. In order to validate the
data set. The data contains water consumed in
predicted data, the training data were first
liters per day. After the acquisition of datasets
analyzed and were recorded for actual and
for time series model, any missing data is
predicted consumption.
pre-processed if required, then modelled for
machine learning algorithm. It is observed that the SVR model predicted
almost close to the real values. For building the
The analysis of this study is based on the water
model time difference data is applied. To make
usage data that was recorded in the dairy
it more categorical, the data set was categorized
industry. This work uses 90 days of daily
into weekdays and weekends. For this case each
consumption to find the actual requirement of
hour of the day is the categorical variable with
water demand with respect to the average
shift-based. In the analysis consumption less
demand in the dairy industry.
than 400 liters per hour is considered as
Using supervised learning of support vector minimum and above 900 liters per hour as
machine regression (SVR) technique, the water maximum. Since this is a time series,
consumption from different units are used as the requirement is to predict the next hour’s as well
input dataset. The purpose of the SVR model is as total consumption in a daily and monthly
used as both the regression and classification scenario. Usually the length of a work shift in
can be performed. The objective of using this dairy farm have to run 24 hours depending on
method is to use the time series data to train the industry’s schedule.
model for prediction. The SVR model is very
robust and outperformed other statistical models
which required a greater number of data sets 7.) Predictive Models for Air Quality
while building the training model. As shown in Monitoring and Characterization
Fig. 3, SVR works effectively on small samples
One of the biggest environmental problems
right now is air pollution. Monitoring air quality
Five predictive models were developed in the
is one of the best ways to prevent the harmful
study, k-nearest neighbors (KNN), support
effects of air pollution. Having the information
vector machine (SVM), Naïve-Bayesian
about the quality of air can lead to formulating
classifier, random forest and neural network.
suggestions and data driven recommendations
Results show that the researchers are able to
to mitigate the possible harmful effects it can
obtain an accuracy of 98.67%, 97.78%, 98.67%,
bring. Air quality is needed to be consistently
94.22%, and 99.56% for all the five models
monitored and assessed to ensure better living
respectively, having the neural network to be
conditions. The U.S. Environmental Protection
the best performing model.
Agency (EPA) uses the Air Quality Index (AQI)
to standardize the air quality. However, AQI

requires precise and accurate sensor readings


and complex calculations, making it not feasible
for portable air quality monitoring devices.

The aim was to find an alternative way of


monitoring and characterizing air quality
through the use of integrated gas sensors and
building predictive models using machine
learning algorithms that can be used to obtain
data driven solutions to mitigate the risk of air The predictive models that are used here

pollution. The proposed methodology is employ supervised machine learning algorithms

implemented by building a prototype for the and as a result, the levels of health concern is

integrated sensors using DHT 11 temperature assigned by the researchers during the data

and relative humidity sensor, MQ2, MQ5 and gathering phase. Careful supervised

MQ135 gas sensors. classification is done to ensure the validity of


the data gathered.
The sensor readings will serve as predictors and
the air quality will be the class variable. A total
of 750 observations are recorded for the purpose
of this paper. After the data gathering, the
dataset undergoes data cleaning and tidying.
Tidy data is a concept used in data science
wherein a data frame should be formatted with
“A Supervised Learning Approach
each variable forms a column, each observation to Water Quality Parameter Prediction and Fault
forms a row Detection”
5) Amrita Tamang ; Samiksha Shukla
Water Demand Prediction Using Support Vector
After building and validating each predictive Machine Regression
models using the different machine learning 6) D. Basak, S. Pal, and D. C. Patranabis, “Support
vector regression,” Neural Information
algorithms, the models will be compared with Processing-Letters and Reviews, vol. 11, no. 10, pp.
203–224, 2007.
one another and select the best predictive model 7) T. Chen and C. Guestrin, “Xgboost: A scalable tree
boosting system,” in Proceedings of the 22nd acm
in terms of accuracy and logarithmic loss sigkdd international conference on knowledge discovery
performance (log loss). and data mining. ACM, 2016, pp. 785–794.

8) F. Pedregosa, G. Varoquaux, A. Gramfort, V.


Based on the data and results, the proposed Michel, B. Thirion, O. Grisel, M. Blondel, P.
Prettenhofer, R. Weiss, V. Dubourg et al., “Scikit-learn:
methodology of characterization of the air Machine learning in python,” Journal of machine
quality index using machine learning-based learning research, vol. 12, no. Oct, pp. 2825–2830,
2011.
predictive models is implemented successfully.
A prototype composed of array of sensors are
developed. Five machine learning models are
established with neural network being the best,
with an accuracy of 99.56% and a 0.0543
logloss performance.

References:

1) Timothy M. Amado ; Jennifer C. Dela -


“Development of Machine Learning-based
Predictive Models for Air Quality Monitoring and
Characterization”-
2) Kathleen Joslyn ; John Lipor
“A Supervised Learning Approach to Water Quality
Parameter Prediction and Fault Detection”
3) Wikipedia
4) Kathleen Joslyn ; John Lipor

You might also like