Multivariate Time Series Classification of Sensor Data From An in
Multivariate Time Series Classification of Sensor Data From An in
2021
Recommended Citation
Rahman, Md Mushfiqur, "Multivariate Time Series Classification of Sensor Data from an Industrial Drying
Hopper: A Deep Learning Approach" (2021). Graduate Theses, Dissertations, and Problem Reports. 8309.
https://researchrepository.wvu.edu/etd/8309
This Thesis is protected by copyright and/or related rights. It has been brought to you by the The Research
Repository @ WVU with permission from the rights-holder(s). You are free to use this Thesis in any way that is
permitted by the copyright and related rights legislation that applies to your use. For other uses you must obtain
permission from the rights-holder(s) directly, unless additional rights are indicated by a Creative Commons license
in the record and/ or on the work itself. This Thesis has been accepted for inclusion in WVU Graduate Theses,
Dissertations, and Problem Reports collection by an authorized administrator of The Research Repository @ WVU.
For more information, please contact [email protected].
Graduate Theses, Dissertations, and Problem Reports
2021
Md Mushfiqur Rahman
Thesis submitted to the Benjamin M. Statler College of Engineering and Mineral Resources at
West Virginia University
in partial fulfillment of the requirements for the degree of
Master of Science
in
Industrial Engineering
I am also grateful to the committee members Dr. Kenneth Currie and Dr. Behrooz Kamali for
their valuable insights and feedback over the course of this thesis.
I wish to thank my parents, without whose prayers and constant support, I could never reach
this stage of my life.
iii
Table of Contents
Table of Contents
List of Figures ........................................................................................................................... vi
List of Tables ......................................................................................................................... viii
1 Introduction ........................................................................................................................ 1
1.1 General Introduction ................................................................................................... 1
1.2 Background ................................................................................................................. 3
1.3 Objectives and Scopes ................................................................................................. 4
1.4 Outline of methodology .............................................................................................. 4
1.5 Organization of the Thesis .......................................................................................... 5
2 Literature Review............................................................................................................... 6
2.1 Traditional Algorithms ................................................................................................ 6
2.2 Deep Learning Approaches ......................................................................................... 8
2.3 Data Labeling ............................................................................................................ 11
3 Methodology .................................................................................................................... 12
3.1 Data Exploration and Preprocessing ......................................................................... 12
3.1.1 Handling Missing Values ................................................................................... 14
3.1.2 Data labeling ...................................................................................................... 15
3.1.3 Characteristics of the labelled dataset ................................................................ 23
3.2 Solution Approach..................................................................................................... 31
3.2.1 Artificial Neural Network .................................................................................. 31
3.2.2 Convolutional Neural network (CNN)............................................................... 35
3.2.3 Recurrent Neural Network ................................................................................. 38
3.2.4 Combination of CNN and LSTM ...................................................................... 42
3.2.5 Machine Learning Algorithms ........................................................................... 43
3.2.6 K-Nearest neighbor (KNN)................................................................................ 46
3.2.7 Performance Measure ........................................................................................ 47
3.2.8 System specification .......................................................................................... 48
4 Experimental Results and Discussion .............................................................................. 50
4.1 Experimental setup .................................................................................................... 50
4.2 Hyperparameter Tuning ............................................................................................ 52
4.3 Result......................................................................................................................... 54
4.3.1 Ensemble Learning (CNN) ................................................................................ 54
4.3.2 Ensemble Method (LSTM) ................................................................................ 56
4.3.3 Ensemble Learning (CNN-LSTM) .................................................................... 58
4.3.4 SMOTE .............................................................................................................. 60
iv
Table of Contents
v
List of Figures
List of Figures
Figure 1.1: Temperature Profiles ............................................................................................... 3
Figure 2.1: Two methods of calculating DTW distance[37] ..................................................... 6
Figure 2.2: Deep learning overview for time series classification[20] ...................................... 9
Figure 2.3: MDDNN model architecture[61] .......................................................................... 10
Figure 3.1: Raw dataset in CSV format ................................................................................... 12
Figure 3.2: Temprature profile obtained from primarily preprocessed data ............................ 13
Figure 3.3: Missing values in the primarily processed dataset ................................................ 13
Figure 3.4: Missing values in the temperature profile ............................................................. 14
Figure 3.5: Missing value imputation ...................................................................................... 14
Figure 3.6: Startup Procedure[36]............................................................................................ 15
Figure 3.7:Cleaning Cycle[36]................................................................................................. 15
Figure 3.8: Conveying Issue[36].............................................................................................. 16
Figure 3.9: Event ...................................................................................................................... 17
Figure 3.10: Variation of an event ........................................................................................... 17
Figure 3.11: Definition of an event .......................................................................................... 19
Figure 3.12: Labelled Data ...................................................................................................... 21
Figure 3.13: Example of an event and a non-event ................................................................. 21
Figure 3.14: Dataset Statistics.................................................................................................. 23
Figure 3.15: A simple visualization of nullity by column ....................................................... 23
Figure 3.16: Nullity matrix ...................................................................................................... 24
Figure 3.17: Statistical summary of Hopper 1 hopper outlet temperature ............................... 24
Figure 3.18: Quantile and descriptive statistics of H1HOT ..................................................... 24
Figure 3.19: Common values and Extreme values of Minimum and Maximum of H1HOT .. 25
Figure 3.20: Binary classification labeling .............................................................................. 25
Figure 3.21: Change of distribution after data normalization .................................................. 27
Figure 3.22: Data normalization .............................................................................................. 28
Figure 3.23: Undersampling and oversampling[72] ................................................................ 29
Figure 3.24: SMOTE[76] ......................................................................................................... 30
Figure 3.25: Ensemble method[77].......................................................................................... 31
Figure 3.26: Basic Structure of a Neural Network [79] ........................................................... 32
Figure 3.27: Activation Function [80] ..................................................................................... 33
Figure 3.28: MLP with one hidden layer[82] .......................................................................... 35
Figure 3.29: Multi channel Deep CNN application on time series [32] .................................. 36
Figure 3.30: CNN for time series classification[83] ................................................................ 36
Figure 3.31: Different types of RNN architecture[84]............................................................. 38
Figure 3.32: Computational Graph of RNN [85] ..................................................................... 39
Figure 3.33: Back propagation through time[87] .................................................................... 39
Figure 3.34: Vanishing and Exploding Gradient [89] ............................................................. 40
Figure 3.35: LSTM structure[31], [89] .................................................................................... 41
Figure 3.36: CNN LSTM architecture[91] .............................................................................. 42
Figure 3.37: Support Vector Machine [94] .............................................................................. 44
Figure 3.38: Optimizing hyperplanes [95] ............................................................................... 44
Figure 3.39: Decision Tree[97] ................................................................................................ 45
Figure 3.40: Random Forest[98] .............................................................................................. 46
Figure 3.41: KNN [100] ........................................................................................................... 47
vi
List of Figures
vii
List of Tables
List of Tables
Table 3.1: Event durations ....................................................................................................... 19
Table 3.2: Sliding Window Algorithm .................................................................................... 22
Table 3.3: Histograms of the temperature zones ..................................................................... 26
Table 3.4: Python libraries ....................................................................................................... 49
Table 4.1: Training and Test Examples ................................................................................... 50
Table 4.2: List of hyperparameters .......................................................................................... 52
Table 4.3: values considered for hyperparameters .................................................................. 52
Table 4.4: Initial values for hyperparameters .......................................................................... 52
Table 4.5: Result summary (average result in ten runs) .......................................................... 62
Table 4.6: Result summary (Best result in ten runs) ................................................................ 63
viii
Introduction
1 Introduction
1.1 General Introduction
In recent years, the advancement of smart manufacturing – the merger of information
technology and operational technology, has made the collection and processing of large number
of data industrial process data attainable. These collection of data which is referred as big data
is often an interchangeable term with artificial intelligence (AI) as big data uses various type
of analytics method of AI, machine learning and deep learning. AI refers to when computer,
robot, or other machines exhibit human-like intelligence. By implementing AI, the computer
or machine can mimic the capabilities of the human mind by learning from examples and
experience. AI is used to recognize objects, understand and respond to commands, make
decisions and solve various kinds of problems. Big data are being used to train different AI
models. As a result, machines can process large amount data faster than before and we can ask
machines to vacuum our floors, finish our sentences while typing and even recommendations
what to watch next on TV [1].
Large numbers of sensors installed in various industrial equipment and machine tools on the
shop floor have accelerated this development and increased the amount of available data even
more. These sensors record the activity of a machine over time. These data sets are referred as
time series data, and their analysis has gained popularity over the last few years. The installed
sensors in the industrial equipment and machinery assemble various time series information
[2] which can be analyzed to obtain meaningful events in smart manufacturing systems. In
addition to manufacturing [3], time series data can be found in various other domains such as
healthcare [4], climate [5], robotics [6], stock markets [7], energy system [8], and many more.
Among these diverse set of domains, the manufacturing domain is in the focus of this paper as
the case study is set in the plastic processing industry.
Sensors are one of the key driving forces in the revolution of intelligent and smart
manufacturing. In an industrial machine tool, sensors may collect data for different key
variables over time which is called Multivariate Time Series (MTS) data. If there is only one
variable measured over time, these data are called univariate time series. So we can say, a MTS
consists of several univariate time series. This is why MTS analysis is considered more
complex than the analysis of univariate time series. One of the main reasons behind this
difference is the correlation between the different variables. In this paper, our analysis of time
series will be limited to one of the most common machine learning problems, classification. In
a nutshell, the goal is to i) identify several key events over the time series and ii) enable the
model to identify the class of any key event from those classified and identified events within
the dataset.
Classification problems mainly deal with categorical variable where each variable belongs to
a specific category and the goal of the classification model is to identify the category of a
specific event. For MTS classification, the whole time series is divided into specific segments,
each of those segments belong to a category with distinguished patterns.
A number of algorithms have been developed to analyze MTS. Some common approaches used
before the evolution of smart manufacturing are simple exponential smoothing [9], dynamic
time warping [10], [11], autoregressive integrated moving average [12]. In addition to those
traditional approaches several machine learning algorithms like K nearest neighbor [13],
decision trees [14], and Support Vector Machine (SVM) [15] were used on multiple occasions.
Some authors used combination of k nearest neighbor algorithm with some distance approaches
like DTW [16], [17] or Euclidean distance measure [18]. It has been shown that no single
traditional approach can outperform the result obtained from the K nearest neighbor algorithm
coupled with some distance measures [19]. However, ensemble methods of different
discriminant classifiers such as SVM and nearest neighbor with some distance approaches and
other machine learning classifiers such as decision tree and random forest can provide better
result than nearest neighbor combined with dynamic time warping method (NN-DTW) [10].
Two of the common issues with the traditional methods are that they often fail to locate
important features within the time series on their own and cannot identify the correlation
between the variables which result in false identification of any categorical event [3]. From this
point of view it is evident that for a univariate time series they may provide reasonably good
result, but for MTS their efficiency may not be good enough. In addition to that, handling of
the massive volume of data is another issue for traditional approaches along with simple
machine learning algorithm. This is why deep learning has come into the picture with the
capability of handling large amount of data by using a deep neural network in multiple layers
to extract meaningful features.
For the last few years deep learning techniques, a variety of neural network algorithms, have
been used extensively to deal with time series problems. For MTS, deep learning approaches
are of special interest as deep neural networks can learn the pattern of the dataset by
understanding the correlation between the variables of interest. It was shown in the literature
that deep neural networks can significantly outperform any traditional methods such as NN-
DTW [20] for both multivariate and univariate time series. For a small number of variables,
for example, in signal processing where two nodes are available and values obtained for those
two nodes over time build a MTS with two variables, NN-DTW may provide good result as
the distance method needs to deal with only two curves. However, when the number of
variables increases, it becomes more complex for NN-DTW. This is why deep neural networks
are of special interest for the case study of this paper where twelve distinct temperature zones
which means twelve variables are present in the dataset.
The most common two neural networks used over the last few years are Convolutional Neural
Network (CNN) and Recurrent Neural Network (RNN) and there has been a lot of variations
developed to tackle with a variety of problems. CNNs gain much popularity for their
contribution to computer vision problems [21]. This is why CNNs have been used extensively
in image recognition tasks [22], natural language processing [23]–[25], and speech recognition
[26]. The speech recognition and natural language processing both can be seen as some sort of
sequential learning problems. This is why although initially developed for computer vision
problems, CNN has been one of the most popular deep neural networks for dealing with time
series problems especially MTS problems [3], [20], [27]–[33]. Another popular neural network
Md Mushfiqur Rahman, MS in Industrial Engineering, IMSE, WVU -2-
Introduction
used recently is RNN which is mainly developed for sequential learning and it performs well
for univariate time series, but its use for the classification of MTS is limited [34]. For the time
series dataset with missing values it provides reasonably good result [35].
This paper aims to answer the following research questions relevant for the case study
presented in the next chapter:
RQ1: Can a deep learning approach provide better result than the traditional
approaches for this case study?
o RQ1.1: Which deep neural network is the most suitable one for this case?
o RQ1.2: Should we use a combination of several neural networks like CNN
and RNN or a single one can provide the best result?
RQ2: How we can deal with the unlabeled data issues?
1.2 Background
In the polymer processing industry dryers are one of the fundamental components for
“supplying dry-heated air that is blown upward through the to-be-dried material for several
hours, while new undried, cold/moist material is continuously loaded on top of the dryer
module, steadily moving downward through the dryer” [36] [37]. The drying hopper has two
distinct components, one of which is drying hopper monitor and another one is the regen wheel.
Both of these have distinct impact on the overall polymer processing. The drying hopper
monitor has eight temperature zones, the regen has three temperature zones and dew point
temperature is also measured for delivery air; all these temperatures are measured by
temperature sensors. These twelve temperatures are measured using sensors over the period of
one year (12 months) for this case study. The final data available is preprocessed with ignoring
missing values and outliers or extraneous cases. Overall these data have temperature readings
for twelve temperature zones collected over a year with a sampling interval of one minute.
Figure 1.1 shows the temperature profiles obtained from the sensors.
As there is a large amount of data available for six main temperature zones in the dryer/ hopper
system and six additional temperature zones in the regen and dryer regions like the hopper 1
delivery air temperature etc., it is possible to extract meaningful features from the real time
analysis of these data. If real time scenario of the drying hopper can be extracted from the
analysis of data, the production planner can determine the type of maintenance necessary.
The main goal of identifying any key event is a part of predictive maintenance so that the
operator of the machine can identify any potential hazard in the process. A massive amount of
data collected from sensors has made this possible in recent years. These large data sets can be
analyzed to identify hundreds of features which can be used to provide meaningful information
about the state and condition of the machines. Predictive maintenance can be defined as “the
maintenance strategy that employs advanced analytics to predict machine failures is known as
Predictive Maintenance” [8].
For this purpose, deep learning algorithm needs to be employed so that using a classifier,
machine can automatically detect whether any specific instance belongs to any disruption
events, based on which proper initiatives can be taken for the smooth flow of production. A
detailed description of the overall process and some distinct events can be found in [36].
1.3 Objectives and Scopes
There are several objectives and scopes of this thesis. First, versatility of MTS analysis needs
to be studied. Drying hoppers mechanism and patterns of temperature profile understanding is
another important objective of this thesis. Afterwards, understanding various parameters
related to MTS and applying this understanding to analyze the current material drying process
and various events associated with industrial drying hopper is another crucial objective of this
thesis. In addition to that, identifying various ways to deal with data labeling and imbalance
data issue bring a lot of potential scopes for this specific drying hopper case. Understanding
various parameters related to machine learning and deep learning algorithms and employing
those algorithms to classify the MTS data is the primary goal of this thesis so that a comparative
analysis on the employed algorithms can be performed.
1.4 Outline of methodology
In order to carry out the experiment, several steps have been incorporated which are:
Study of the state of art of MTS classification with traditional approaches like
machine learning algorithms and deep learnin algorithms
2 Literature Review
A variety of algorithms have been applied to solve MTS classification. MTS data have
increased in various domains like anomaly detection, clinical diagnosis, weather prediction,
stock price, human motion detection, fault detection in manufacturing process and so on.
Among these variety of fields, MTS data have become very common in manufacturing industry
due to the use of variety of sensors installed at the machineries in the shop floor of a
manufacturing plant. This is why MTS analysis like classification has gained extreme
popularity among researchers in the manufacturing domain. With the increasing importance in
temporal data mining, researchers have continuously been developing variety of algorithms to
tackle a variety of problems in this field. Among temporal data mining problems, multivariate
analysis provides high complexities with increasing number of variables which might be highly
correlated or not. Overall the spatial structure in temporal data, time dependency, correlation
among variables etc. need to be carefully handled when dealing with any MTS analysis. In this
section, current state of the art of MTS classification will be presented from two point of views;
one of them is the traditional approach and the other is the AI approach like deep learning.
2.1 Traditional Algorithms
A benchmark algorithm used for classifying MTS is K- nearest neighbor with dynamic time
warping. Two approaches can be taken for MTS data according to the authors in [37]. One of
them is summing up the univariate time series DTW distances for the dimension of MTS
whereas in the other one, distance between two time steps is calculated through summing up
distance between each MTS which as shown in Figure 2.1.
technique is also used by the authors in [39]. DTW multivariate prototyping is used in
evaluating scoring and assessment methods for virtual reality training simulators. It classifies
the VR data as novice, intermediate or expert where 1-NN DTW performed reasonably well,
the only better algorithm for this case was RESNET; an advanced version of CNN [40].
Overall, using DTW as a dissimilarity measure among features of time series and adapting the
nearest neighbor classifier in temporal data mining was very popular before the evolution of
deep learning [41].
A parametric derivative DTW is another variant of the DTW used in temporal data mining.
This technique combines two distances which are DTW distance between MTS and the DTW
distance between derivatives of MTS. This new distance is used afterwards for classification
with nearest neighbor rules [42]. Using a template selection approach based on DTW so that
the complex feature selection approach and domain knowledge can be avoided is another
approach taken for classifying MTS in [43]. Another variant of DTW is using DTW distance
measure with integral transformation. Integral DTW is calculated as the value of DTW on the
integrated time series. This technique combines the DTW and integral DTW with the 1-nearest
neighbor classifier which shows no overfitting issue [44].
DTW has also been used with hesitant fuzzy sets where time instance segments get more
attention than treating MTS data as a whole object or time instance one by one. In this method,
alignment between time instance segments is optimized as claimed by the authors in [45]. Their
research also showed that this method can be reduced to original DTW by setting scale
parameters. Furthermore, this method can balance the time consumption and accuracy of the
MTS classification.
Data normalization is a commonly used technique in any temporal data mining problem as
different variable have values which might be highly different from one variable to another.
But sometimes, normalization might destroy the information existing in the raw data which is
why combination of both raw data and normalized data might preserve meaningful information
about the data. The authors in [46] used this approach on nearest neighbor with DTW and
obtained better classification accuracy. Longest common subsequence method is sometimes
incorporated with DTW to provide better classification accuracy [47].
Symbolic representation of MTS is another traditional technique used for MTS classification
which considered all elements of the time series simultaneously and symbols are learned
through using a supervised learning algorithm. A tree based ensemble is used to detect the
interactions between each univariate time series represented as columns with time index. A
second ensemble is used to handle high dimensional input through implicit feature selection.
These tree learners can efficiently handle nominal and missing values [48]. MrSEQL is another
technique based on symbolic representation which is used by research that transform time
series data in time domain known as symbolic aggregate approximation (SAX) [49] and
frequency domain known as symbolic fourier approximation (SFA) [50]. Discriminative
subsequences are extracted from this symbolic data and these are used as features for training
a classification model [51][52]. Word extraction for time series classification plus multivariate
unsupervised symbols and derivatives abbreviated as WEASEL+MUSE also uses SFA
transformation to create sequence of words. A feature selection method determines promising
features and these features are extracted from all dimensions. Feature selection is performed
using a chi-squared model and then logistic regression is used to learn the features [53].
In dealing MTS classification, two major components need to be considered which are
approximating sequential dynamics and learning relationship among different variables. In
[54], authors used distance based method for approximating sequential dynamics, whereas
granger causality is used to learn the relationship among different variables. Sparsity of the
learnt time series is constrained to find the focal series.
One of the most extensive research on traditional methods for both MTS and UTS can be found
in [10] which highlights almost all of the above mentioned traditional approaches in different
categories like whole series similarity, phase dependent intervals, phase independent shapelets,
dictionary based classifiers, and combinations of transformations. This paper is a great resource
for any time series classification enthusiast to get an overview of all the traditional methods.
Another review paper which shows a brief overview of different classification approaches for
MTS can be found in [52].
Machine learning algorithms, both nonlinear techniques and ensemble learning techniques
have also been applied for time series classification over the year. Traditional classifiers like
Naïve bayes, Decision Tree and SVM are the most popular ones. Before using these algorithms
MTS data needs to be converted into feature vector format. This is why the authors in [55]
segmented the time series for obtaining a qualitative description of each series and determined
the frequent patterns. Afterwards the patterns which are highly discriminative between the
classes are selected and transformed the data into vector format where the features are the
discriminative patterns.
2.2 Deep Learning Approaches
With the evolution of deep leaning CNN has been used mostly over the years in temporal data
mining especially for classification. Moreover, RNN like Long short time memory has also
been the example of recent algorithmic advance in time series classification problem.
Furthermore, combination of both of these two algorithms which are although developed for
different purposes showed extremely good result for time series classification problem. Over
the year, a lot of different versions of these algorithms have been proposed by the researchers
and those are performing well in different case studies. In this section, several papers which
have used these algorithms and their variants will be discussed briefly.
CNN has been adapted to time series classification with 1D filter in the convolutional layer.
The reason of its popularity is it can discover and extract suitable internal structure to generate
the deep features of the input time series automatically through convolution and pooling
operation [56]. This is not really the case in traditional feature extraction method where features
need to be extracted manually through feature engineering.
“Deep learning for time series classification: A review” [20] and “The great MTS classification
bake off: a review and experimental evaluation of recent algorithmic advances” [57] are the
two papers which provided the summary and basics of the recent algorithmic advance in the
use of deep learning for MTS classification. Natural language processing (NLP) and Speech
recognition (SR) are two fields where RNN and LSTM has been highly successful over the
year and recently CNN has also showed high performance in terms of accuracy. NLP and SR
both have sequential aspects which is similar to time series analysis. An overview of the deep
learning approaches for times series classification taken from [20] is shown in Figure 2.2.
the image and identifying the image of time series as production threatening or not. In [59], the
authors performed a principal component analysis for feature extraction and reducing the
number of MTS variables to two so that they can identify the most useful two components in
the machine. The time series are encoded into image using Gramian angular field (GAF) and
used the images as input for the CNN. Another similar research can be found in [33] where
three techniques of converting MTS data into images have been used and tested which are
GAF, Gramian angular difference field (GADF), and Markov Transition field (MTF). It has
been found that different approaches of converting MTS into images do not affect the
classification performance and a simple CNN can outperform other approaches. In
semiconductor manufacturing it has been tested that MTS- CNN can successfully detect the
fault wafers with high accuracy, recall and precision [3].
Combining CNN, LSTM and DNN has been another highly used approach over the year. In
[60], the authors proposed a combined architecture abbreviated as CLDNN and applied on
large vocabulary tasks which outperformed three individual algorithms. Another similar
approach named as MDDNN has been used to predict the class of a subsequence in terms of
earliness and accuracy. Attention mechanism is incorporated with the deep learning framework
in order to identify critical segments related to model performance [61]. The proposed
framework as shown in Figure 2.3 used both time domain and frequency domain through fast
fourier transformation and merged them together for prediction. Another similar research
focused on early classification can be found at [28].
the mistakes of the LSTM which outperformed other state of the art in human activity
recognition.
2.3 Data Labeling
Classification is a supervised learning technique which needs labelled data. But the dataset
used in this thesis is not labelled which is why data labeling was the primary concern before
using any supervised learning algorithm. In literature, very few works on time series data
labeling can be found. The technique researchers often used is known as semi supervised and
active learning for univariate time series. In [64], the authors focused on active learning with
positive unlabeled data. Their framework proposed a sample selection strategy to find the most
informative samples for manual labeling. They introduced two active learning approaches
which obtained high confident training dataset for classification.
Another paper addresses the labeling issue and the relevance of self-labeling techniques and
semi supervised learning technique for time series classification. An empirical study was
performed to compare self-labeled methods and various learning schemes and dissimilarity
measure. The authors experimented with 35 different datasets with different percentage of
labelled data in order to measure the transductive and inductive classification capabilities of
the self labelled data [65].
Semi supervised learning approach has been extensively used in text classification, but in time
series domain it has not been used much. In [66], the authors made special consideration to
adapt the well-known semi supervised approach into time series domain. Their approach was
tested on diverse data sources like electrocardiograms, handwritten documents, manufacturing
and video datasets. The results of the experiment showed that only a small amount of labelled
data is needed for using the semi supervised approach.
In this chapter, a brief overview of the current state of the art of MTS have been presented. In
the next chapter, methodology of this thesis with data exploration as well as necessary
preprocessing and various algorithms tested on this dataset will be discussed.
3 Methodology
3.1 Data Exploration and Preprocessing
As mentioned in the background section, there are twelve distinct temperature zones. The
temperature zones are:
Delivery Air Dewpoint (DAD)
Regen Temperature Active Setpoint (RTAS)
Regen Temperature Wheel Inlet (RTWI)
Regen Temperature Wheel Outlet (RTWO)
Hopper 1 Delivery Air Temperature (H1DAT)
Hopper 1 Hopper Outlet Temperature (H1HOT)
Hopper 1 Drying Monitor 1 Temperature (Bottom) (H1DM1T)
Hopper 1 Drying Monitor 2 Temperature (H1DM2T)
Hopper 1 Drying Monitor 3 Temperature (H1DM3T)
Hopper 1 Drying Monitor 4 Temperature (H1DM4T)
Hopper 1 Drying Monitor 5 Temperature (H1DM5T)
Hopper 1 Drying Monitor 6 Temperature (H1DM6T)
The raw data obtained from the machine is preprocessed to obtain the final data file ignoring
missing values and outliers almost in all cases. These twelve temperatures are measured using
sensors over the period of one year (12 months) for this case study although the obtained data
file contains sensor reading of six months. The final dataset is prepared using a sampling
interval of one minute. A chunk of the dataset is shown Figure 3.1 and Figure 3.2.
The first time step in the data file is 1525150860000 which can be converted to the real date
and time as May 1, 2018 5:01:00 AM. So, the temperature reading starts from May 1, 2018
5:01:00 AM and ends at November 1, 2018 5:00:00 AM. As the sampling interval is one
minute, no. of entries in the time series can be calculated in the following way:
No. of entries in the time series dataset:
19 (May 1) + 30*24 (May 2 – May 31) + 30*24 (June) + 31*24 (July) + 31*24 (August) +
30*24 (September) + 31*24 (October) + 5 (November) = 4416 hours = 4416 * 60 = 264, 960
minutes.
But the data file contains 263, 476 entries which indicates 264, 960 – 26 3, 476 = 1, 484 minutes
of data are missing. A detail investigation of the temperature profiles reveal those missing
values in the dataset. One of those examples are shown in Figure 3.3 and Figure 3.4.
If time steps are missing within an event, then the missing values will be imputed using
the moving average of the sixty observations which are in the event (either happening
before or afterwards of the missing time steps). Figure 3.5 shows such an example, where
an event has started at around 10:37 am and data are missing from 10:40 am to 12:24 pm.
It cannot be imputed with rolling average of previous observation as an event has already
started.
3.1.2 Data labeling
Classification is a supervised learning technique which requires labeled data to learn the
intrinsic behavior of events. Supervised learning is then used to predict the class of any event
which is essential for example in fault diagnostics applications. For this case study, three major
events were identified that occur regularly and have an impact on operations: startup procedure,
cleaning cycle, and conveying issues [36] which are shown in Figure 3.6, Figure 3.7 and Figure
3.8.
Hopper 1 delivery air temperature sharply drops from 175° F to 125° F, then becomes
steady for around 15 minutes, then drops slowly to around 100° F.
Hopper 1 outlet temperature drops from around 150° F to 100° F within 10 minutes.
Return air temperature dry inlet shows a small drop 100° F from 115° F, then becomes
steady again.
Delivery air dew point increases from around 10° F to around 40° F. This change can be
seen around 25-30 minutes after the other temperature drops occur.
Among 12 temperatures, certain amount of deviations can be observed in the above mentioned
8 temperatures. The deviation for hopper 1 drying monitor temperature 2 and 5 as well as return
air temperature dry inlet are minor compared to other 5 temperatures where a significant
amount of temperature drop can be observed. When defining the event, the variations of this
event need to be considered while the overall scenario might be same with a little variation in
temperature drop. Another such event is shown in Figure 3.10.
original time series data. Afterwards, this variable and the broken down event markings data
are used to create a column of 1s and 0s that exactly match the rows of the original dataframe.
In short the steps are:
Prepare a list of events with the start time and finish time.
Break the start and finish times of the events into a column of milliseconds where each
entry is 60, 000 milliseconds or 1 minute apart from each other.
Convert this list into a column of 1s and 0s.
Add this column to the original time series data file.
More details of this procedure can be found in [69]. A portion of the final dataset with the
labelled column is shown in Figure 3.12.
So, instead of labeling a minute of MTS data or one row as one example (either event or non-
event), sixty minutes or one hour of MTS data or sixty rows simultaneously will be considered
as an example. Each sixty rows or 1 hour of MTS data will be considered as one subsequence.
A subsequence is a piece extracted from a long sequence with a specific length; in this case the
length of each subsequence is sixty minutes.
3.1.2.5.1 Sliding Window
Sliding window algorithm is a well-known technique to extract subsequences from a long time
series. Two parameters need to be defined before using a sliding window which are window
length and sliding step. As each example has a length of sixty minutes, window length will be
taken as sixty. Sliding step will also be taken as sixty as after picking sixty minutes of data, if
we want to move to next example, we have to move sixty minutes forward. Then another
subsequence with a length of sixty minutes will be extracted from the long time series and will
be labelled.
As mentioned earlier, the events are defined as an hour. When the dataset was labelled by each
row or each minute, each minute was assigned a label. Now the goal is to assign a label for
each sixty minutes. So, using sliding window algorithm all the subsequences will be extracted
and a label will be assigned. As mentioned earlier, the example of events and non-events are
defined as an hour. So, in primary labeling, all sixty rows or minutes of each hour are assigned
same label. After subsequence extraction, the label of each hour will be assigned according to
the labels given to the all sixty rows or minutes of that particular hour.
If a MTS, T has a length of n, window size is L and the sliding step is p, the number of extracted
subsequences, m can be obtained by using the following formula [32]:
𝑛−𝐿+1 𝑛−𝐿
𝑚= ⌈ ⌉ Or 𝑚 = +1
𝑝 𝑝
In our case, the length of the time series, n = 264,960, window length, L = 60, and sliding step,
p = 60. So, m = (264,960-60)/60+1 = 4,416.
So, using a window length of sixty and sliding step of sixty, 4,416 subsequences can be
extracted from this time series. The pseudocode for the extraction of subsequences and the
labelling of the extracted subsequences are shown in Table 3.2.
Table 3.2: Sliding Window Algorithm
After labeling each subsequence, the dataset can be viewed as a three dimensional dataset with
dimension N*L*M where N represents Nth example or subsequence, L represents the window
length, M represents number of sensors or input variables of the MTS. Each subsequence has
a dimension of L*M. In this case, L = 60 and M = 12, so, each subsequence has 60*12 = 720
features of the MTS.
3.1.3 Characteristics of the labelled dataset
A detailed characteristics of the labelled dataset can be obtained by using pandas profiling tool
in python. A basic summary of the labelled dataset is provided in Figure 3.14.
Figure 3.19: Common values and Extreme values of Minimum and Maximum of H1HOT
Figure 3.18 represents the Quantile statistics like percentile, minimum, maximum, median,
range as well as interquartile range and descriptive statistics like variance, standard deviation,
mean absolute deviation (MAD), skewness, monotonicity and so on. Figure 3.19 represents the
common values as well as the minimum and maximum extreme values of hopper 1 hopper
outlet temperature. Detailed characteristics of the other eleven variables can be inspected in
similar fashion. The frequency distribution of all twelve variables can be obtained from the
histogram which are shown in Table 3.3.
3.1.3.2 Distribution of the input and output variables
It is clearly evident from the histograms that either there is a skewness or the histogram is
bimodal in shape which indicates a clear division in the temperature values of all twelve
temperatures zones. For example, H1DM1, H1DM2, H1DM3, H1DM4, H1DM5, H1DAT,
RTWI, DAD, and RTAS, all these temperature zones are clearly divided in two regions which
indicate the events and non-events. This is also reflected in other histograms like RTWO,
H1HOT, and H1DM6 which are more like a bimodal shape. When a failure event occurs, the
temperature drops all on a sudden from the steady state values. This phenomenon is clearly
reflected in almost all temperature zones. This is why, the approach to treat this problem as a
binary classification problem is highly justified. The labeling was done in this fashion that the
non-events which are very high in number are labelled as class 0 and the events which are very
low in number are labelled as class 1.
The detail of the categorical variable, labels is shown in Figure 3.20. From the pi-chart and
histogram it can be seen that only 19.1% of the data belong to the class 1 (events), where 80.9%
of the data belong to the class 0 (non-events). This phenomenon further justifies the distribution
shown in the histograms of the temperature zones. The value count for class 0 shown in Figure
3.20 is 214,500, as each complete example or subsequence of an event or non-event was
defined as an hour previously, the number of subsequences belong to class 0 is 214,500/60 =
3,575. On the other hand, the value count for class 1 is 50,460, so the number of examples or
subsequences belong to class 1 is 50,460/60 = 841. The total number of subsequences or
examples (both events and non-events) are 3,575+841 = 4,416 which matches with the previous
result of number of subsequences obtained by using sliding window algorithm.
Table 3.3: Histograms of the temperature zones
DAD RTAS
RTWI RTWO
H1DAT H1HOT
H1DM1 H1DM2
H1DM3 H1DM4
H1DM5 H1DM6
from 0 to 1. In order to normalize the data, the maximum and minimum value of each variables
needs to be identified. Afterwards a value is normalized as follows: Normalized X = (X- min)/
(max-min). In python, this task is done by using MinMaxScaler for a scikit learn object. In this
case, after using data normalization, the dataset looks like as shown in Figure 3.22. It can be
noticed that the output variable is not used for data normalization as it is a categorical variable.
simple and effective, but the issue is examples are removed without any concern for how useful
or important they might be in determining the decision boundary between the classes. This
means it is possible, or even likely, that useful information will be deleted [71]. There are other
undersampling techniques which are out of the scope of this thesis.
3.1.3.4.2 Oversampling
The simplest oversampling techniques is duplicating the minority class randomly over and over
until the minority class is equal to the majority class and make the dataset balanced. In this
case, 791 training examples from the minority class will be duplicated randomly to create 2742
examples of minority class and will be added to training dataset. Figure 3.23 shows the visuals
of both undersampling and oversampling.
𝜕𝐽
𝑑𝐴[𝐿−1] = [𝐿−1]
= 𝑊 [𝐿]𝑇 𝑑𝑍 [𝐿] (7)
𝜕𝐴
After obtaining the derivatives parameter updates are performed in the following way:
shape of (n*m) where n is the number of features in each example and m is the number of
examples which are 720 and 3533 respectively in this case. A typical one MLP with one hidden
layer is shown in Figure 3.28.
Figure 3.29: Multi channel Deep CNN application on time series [32]
Another approach is rather than processing each variable separately, twelve variables with a
subsequence length of sixty will be processed simultaneously and flattened after extracting
meaningful features through convolutional layer and pooling layer.
3.2.2.1 Convolutional Layer
A CNN mainly consists of three layers which are the convolutional layer, pooling layer and
fully connected layer. In convolutional layer the input feature matrix will be used to perform a
convolution with a fixed filter. The filter is initially defined randomly which are trained over
the training process to obtain the desired values of the filter with the help of a cost function.
There are actually two filter parameters, one of them is the no. of filters and the other one is
filter size or kernel size. Typical values of kernel size for 1D convolution operation in time
series application are 3,5,7 and so on which depends on the length of the input subsequence.
Stride and padding are two most important parameters to be decided during convolution
operation. Stride is mainly used to define how many units the filter will shift during convolution
operation over the input feature vector. A common scenario in the convolution operation is the
input feature vector size will go down continuously over many layers of convolution. The
problem is when the size of the input matrix will be reduced, there is a possibility that many
important features might be lost. To overcome this, padding operation is performed so that
even with a smaller filter than the input vector, the size of it will be the same. It will help the
convolution operation to go slowly over the layers of the neural network. So, two types of valid
operations exist, one of them is called valid, where padding is performed and it is not very
common in time series analysis and the other one is same, where no padding operations is
performed, so the input feature vector shape decreases with the convolution operation. A
typical example of the convolution operation which is more like a sum product operation is
shown below.
For example, the input subsequence is [5, 4, 9, 2, 7, 6] and the filter we are using has a kernel
size of 3 which is (2, 3, 1)
So the output subsequence will be [5*2+4*3+9*1, 4*2+9*3+2*1, 9*2+2*3+7*1,
2*2+7*3+6*1) or [31, 37, 31, 31].
Each term of the output subsequence is obtained by using the following formula
𝑛[𝐿−1] − 𝑓 [𝐿]
𝑛[𝐿] = ⌊ + 1⌋
𝑠 [𝑙] (11)
The first architecture as shown in Figure 3.31 is the traditional feed forward neural
network which is known as one to one architecture with fixed size input and output length.
The second architecture is one to many where the output can be of variable length. The
common example of this architecture is image identification where the input is an image
and the output is a text describing the image.
The third architecture referred to as many to one which is used for sentiment classification
or time series analysis like forecasting and classification. In this case, many to one
architecture will be used where the input is a subsequence and the output will be the class
of that subsequence.
The last two architectures known as many to many which is commonly used for language
translation like Google Translator where the input and output both can be of variable
lengths. For time series analysis, this architecture can also be used to forecast time series
like forecasting the stock price of next seven days at a time.
A general computational graph of RNN is shown in Figure 3.32.
Back propagation through time is performed at every single time step T, the basic equation for
BPT is shown below [88]. More details on the BPT can be found at [84] where each step of
gradient calculation and weight updating can be found.
𝑇
𝜕𝐿(𝑇) 𝜕𝐿(𝑇)
=∑
𝜕𝑊 𝜕𝑊 (𝑡) (15)
𝑡=1
When computing the gradient with respect to h0 involves many factors of Whh and consequently
there is a long term dependency when handling the repeated gradient computation back in time.
The two common problems faced by RNN are vanishing and exploding gradient.
When gradient becomes too small, updating of the parameters do not add any
significant information to the process as it becomes insignificant with not really a major
update of the weights. So with respect to the number of layers with long sequence, RNN
suffers from long term dependency. This situation of insignificant weight update due to
very small values of gradient is known as vanishing gradient.
The second problem known as exploding gradient happens when gradient becomes too
large with exponential growing, the resulting weight after the update becomes too large.
Where xt = input timestamp at time t, wif = weight matrix for the input, ht-1 = hidden state at
time t-1, whf = weight matrix for the hidden state; bif and bhf are the bias terms associated with
input and hidden state respectively. Sigmoid function as shown in Figure 3.27 is used after the
linear combination of weights and biases which will convert the value of ft between 0 and 1. If
ft = 0, the network will forget the information and if ft = 1, the network will remember the
information for further processing.
Input gate and New Information Processing: The input gate is used to quantify the
information obtained from the newest input [90]. The basic equation is provided below.
𝑖𝑡 = 𝜎(𝑤𝑖𝑖 𝑥𝑡 + 𝑏𝑖𝑖 + 𝑤ℎ𝑖 ℎ𝑡−1 + 𝑏ℎ𝑖 ) (17)
Where wii= weight matrix for the input, whi = weight matrix for the hidden state; bii and bhi are
the bias terms associated with input and hidden state respectively for the input gate. A sigmoid
function is used in this gate as well to convert the input timestamp between 0 and 1.
Afterwards, the new information which will be passed to the cell state needs to be processed.
The basic equation is shown below.
𝑔𝑡 = 𝑡𝑎𝑛ℎ(𝑤𝑖𝑔 𝑥𝑡 + 𝑏𝑖𝑔 + 𝑤ℎ𝑔 ℎ𝑡−1 + 𝑏ℎ𝑔 ) (18)
𝑐𝑡 = 𝑓𝑡 𝑐𝑡−1 + 𝑖𝑡 𝑔𝑡 (19)
Where wig= weight matrix for the input, whg = weight matrix for the hidden state; big and bhg
are the bias terms associated with input and hidden state respectively for the update of new
information, ct-1 = cell state at timestamp t-1. Here tanh activation function as shown in Figure
3.27 is used. The reason is the tanh activation function converts the value of gt between -1 and
1. If the value of gt is negative, the new information will be subtracted from the previous cell
state and if it is positive, it will be added to the new information.
Output gate: The basic equations for the output gate is shown below.
𝑂𝑡 = 𝑡𝑎𝑛ℎ(𝑤𝑖𝑜 𝑥𝑡 + 𝑏𝑖𝑜 + 𝑤ℎ𝑜 ℎ𝑡−1 + 𝑏ℎ𝑜 ) (20)
ℎ𝑡 = 𝑜𝑡 ∗ tanh(𝑐𝑡 ) (21)
The output value used a sigmoid function like the previous gates and turn the output into a
value between 0 and 1. Afterwards, the current hidden state will be calculated using the output
and the cell state at time t with a tanh activation function which indicates the hidden state is
function of the long term memory ct and the current output.
In this thesis, LSTM will be used to capture the time dependency of the events occurred in the
drying hopper. For a large sequence, LSTM will figure out which information might be
necessary for this case and use that information to maintain the time dependency. In addition
to that, as an improvement over RNN, LSTM has the high capturing power in terms of temporal
information. But it does not really care about the internal features like spatial feature which is
mainly handled by CNN. So, in the next section combination of both LSTM and CNN will be
explored.
3.2.4 Combination of CNN and LSTM
Combination of CNN and LSTM which is commonly known as CNN LSTM is mainly used
for capturing internal features of input like time series or sequence of images through CNN
layer and LSTM layer for sequential learning simultaneously. The architecture also includes a
deep neural network like MLP at after the CNN and LSTM layer. The architecture of CNN
LSTM is shown in Figure 3.36 [91].
CNN LSTM works well on dataset which have 2D structure or pixels in an image or 1D
structure like words in sentence. In addition to that, the input or output or both has a temporal
structure. In this case, drying hopper temperature profile has both spatial and temporal features.
The spatial features are for example, the peak temperature value of an event or the specific
pattern observed in an event. Moreover, as the drying hopper temperature values are recorded
over the period, any event or non-event has temporal dependency. For example, this dataset is
processed in such a way that each event or non-event is of one hour length. Each time step in
an event is temporally dependent on the previous time step. So, theoretically it makes more
sense to use the CNN LSTM model. So in simple block diagrams, the architecture looks like
as follows.
A CNN LSTM model uses the input subsequences as blocks, extract features from each block
and then uses LSTM to use the flattened extracted features and identify its own features before
a final mapping on each class is made [93]. In this case, the input subsequence of length 60
will be divided into four subsequences each of which will have a length of 15 minutes.
3.2.5 Machine Learning Algorithms
In this thesis, several machine learning algorithms will be used along with the deep learning
techniques to perform a comprehensive evaluation of these algorithms on drying hopper use
case dataset. Both non-linear algorithms like k-nearest neighbors, classification and regression
tree, SVM and naïve bayes as well as ensemble algorithms like bagged decision trees, random
forest, extra trees and gradient boosting will be used to evaluate the performances of these
algorithms compared to the deep learning algorithms and traditional methods like dynamic time
warping with k nearest neighbor. In the next two subsection, a brief background of SVM and
random forest will be provided to have an overview of both non-linear method and ensemble
method.
3.2.5.1 Support Vector Machine (SVM)
SVM is a non-linear machine learning algorithm used for supervised learning like classification
or regression problem mostly for classification problems. The main idea is each data points of
all classes are plotted in an n-dimensional space with n number of features. The final goal is to
obtain the hyper plane that separates the classes. A simple example of SVM is shown in Figure
3.37 where three hyperplanes are shown among which optimum hyperplane will be chosen
which separates the two classes better.
Precision:
Although accuracy is a performance measure for the overall datasets, precision is a
performance measure for individual class. Precision is a performance measure which is related
to the prediction. It can be defined as the ratio between correctly predicted positive class
example and all predicted positive class examples or the definition can be provided in terms of
negative class as well. It is a well-known measure to identify percentage of examples from a
class which are correctly identified in terms of predicted labels.
𝑇𝑃
𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 = (23)
𝑇𝑃 + 𝐹𝑃
Recall:
Another highly useful performance measure is recall also known as sensitivity which is related
to the truth instead of prediction. It can be identified as the ratio between the examples which
are actually positive and the examples which are predicted as negative, but actually positive.
This definition can be extended to the negative class perspective as well. It is also a well-known
measure to identify the percentage of examples from a class which are correctly identified in
terms of the actual labels.
𝑇𝑃
𝑅𝑒𝑐𝑎𝑙𝑙 = (24)
𝑇𝑃 + 𝐹𝑁
F1 score:
Probably the best performance measure is the f1 score which considers the data imbalance
issue. It is a more structured performance measure using precision and recall.
2 ∗ 𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 ∗ 𝑟𝑒𝑐𝑎𝑙𝑙
𝐹1 =
𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 + 𝑟𝑒𝑐𝑎𝑙𝑙 (25)
Another useful graphical technique to visualize the model’s performance is confusion matrix.
A confusion matrix is a matrix with number of dimension equals to the number of classes. A
typical confusion matrix for two classes is shown in Figure 3.42.
In the next chapter, the detail of the experimentation on the models described in this chapter,
result and evaluation will be highlighted. Afterwards, the discussion regarding the result and
limitation of the thesis will be presented.
Figure 4.1: Confusion matrix and classification report of imabalanced dataset (CNN)
Figure 4.1 shows a very high accuracy of the model and can fool anyone who is not aware of
the distribution of the two classes in the test set. As shown in Table 4.1, 93.43% test examples
belong to class 0. This is why the deep learning model showed an accuracy of 93.43% as it
cannot identify any events of class 1. All the events of class 1 were identified as 0 which
resulted in the test accuracy of 93.43%. The performance of this model using CNN can be more
clearly understood from confusion matrix and classification report with precision, recall and f1
score as shown in Figure 4.1.
As a remedy to this issue, ensemble learning, oversampling and SMOTE have been used. For
ensemble learning, three approaches have been taken to evaluate the effectiveness in terms of
precision and recall.
Approach 1: 2,742 training examples of class 0 are divided into five groups. As 2,742 is not
divisible by 5 (2,742/5 = 548.4), the first three groups have 548 examples each, other two
groups have 549 examples each, combining together five groups have total 548*3+549*2 =
2742 examples of class 0. Afterwards, 548 or 549 examples from class 1 were chosen
randomly. These 548 examples from class 1 and 548 examples from class 0 are combined and
shuffled to obtain one group of dataset for training. In this way, five group of training sets have
been generated and each of them is trained using a separate model and majority voting is used
at the end in the following way.
5
𝑛1 = ∑ 𝑦̂𝑖
(26)
𝑖=1
1, 𝑛 > 2
𝑦̂ = { 1
0, 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒 (27)
Here, n1 is the sum of the outputs of any particular test example from five undersampling
models which can be maximum of five if all models determined that example as 1 and
minimum of 0 if all models determined that example as 0. n1 can also be defined as number of
models which determined the class of that particular example as 1.
Approach 2: 2,742 training examples of class 0 are divided into three groups. First two groups
have 791 examples each and the last group has 2,742-791*2 = 1,160 examples. Afterwards,
791 examples of class 1 will be combined with each group to build three group of datasets.
Each group will be shuffled properly before training. In this way, three group of training sets
have been generated and each of them is trained under a separate model. Majority voting is
used in the following way.
3
𝑛1 = ∑ 𝑦̂𝑖
(28)
𝑖=1
1, 𝑛 > 1
𝑦̂ = { 1
0, 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒 (29)
Approach 3: This approach is similar to the previous one. The only difference is 2,742
examples are divided into three equal segments where each of the group has 914 examples.
Afterwards, 791 examples are combined and shuffled with each of the three groups. In this
way, each training dataset has 914 (class 0) +791 (class 1) = 1,705 training examples. The
majority voting is used similarly as shown in approach 2.
4.2 Hyperparameter Tuning
In order to use deep learning algorithms, hyperparameter tuning is a very important step. A lot
of hyperparameters exist in a deep learning algorithm from which the ideal combination needs
to be selected for optimal performance. Table 4.2 shows a list of parameters and
hyperparameters associated with CNN, LSTM and MLP.
Table 4.2: List of hyperparameters
As mentioned in the previous chapter, Keras deep learning framework in python is used for the
experimentation which runs on top of tensorflow. Apart from the hyperparameters shown in
Table 4.2, there are other hyperparameters as well. One of the most important hyperparameters
in any deep learning algorithm is the learning rate, α. The default value of α in keras is 0.01
with no momentum. The default value of α was used for the initial test run. First group of
samples mentioned in approach 2 was taken for the hyperparameter tuning of CNN and LSTM.
The following tables show the summary of the hyperparameters for the initial experiment on
hyperparameter tuning.
Table 4.3: values considered for hyperparameters
Different values for number of filters, filter size and batch size have been used as shown in
Table 4.3 and other hyperparameters were kept constant as shown in Table 4.4. 10 experimental
runs were performed to figure out the best set of hyperparameters. As neural network is highly
stochastic in nature due to random weight initialization, each run obtains different test
accuracy. This is why test accuracy average and variability has taken into consideration for
choosing a hyperparameter. Figure 4.2 shows the box plots for number of filters, filter size and
batch size in terms of test accuracy.
From Figure 4.3, it is evident that the model is converging to the optimal solution. So these
hyperparameters will be used in the final experiment except no. of epochs. Although the
accuracy is good, but it is also evident that the model suffers from overfitting which can be
seen in Figure 4.3 that the learning curve is fluctuating continuously. This indicates the model
has learned too much and is suffering from generalization error [102]. From the figure, it can
be seen that within 10 epochs the model reaches almost zero loss and very high accuracy. This
is why number of epochs will be used as 10 in the final experiment.
4.3 Result
After performing the hyperparameter search, final experimental runs were performed. The
following section provides a summary of the results for all of the considered algorithms and
corresponding data balancing techniques.
4.3.1 Ensemble Learning (CNN)
In the final experiment, 10 experimental runs were performed for each of the deep learning and
machine learning algorithms. For this specific drying hopper case, the main goal is to capture
the events automatically so that a predictive maintenance approach can be taken as a remedy.
As the number of non-events are very high, it is natural that if the model can determine the
non-events correctly, accuracy will be automatically higher regardless of whether the model is
able to identify the events or not as they are very low in numbers. This is why instead of
accuracy, precision of class 0 and recall of class 1are more important in this case as these two
depend on how many of events are wrongly identified also known as false positive. For this
case study, it is desirable to have precision of class 0 and recall of class 1 as high as possible
or number of false positive as low as possible. Figure 4.4 shows the CNN framework used in
this thesis.
Segment 2 of approach 3 where 914 training examples of class 0 (915 to 1828) and 791 training
examples of class 1 were combined, obtained best result in four experimental runs. This
evidence clearly indicates these 914 training examples of class 0 provide significant
information for training a CNN model. If undersampling is used without any ensemble
learning, there is an uncertainty about the significance of that specific segment of the data used
for training. Figure 4.8 shows the best learning curves obtained from the best undersampling
strategies.
Segment 3 of approach 2 where 1160 training examples of class 0 (1583 to 2742) and 791
training examples of class 1 were combined, obtained best result in four experimental runs.
This evidence clearly indicates these 1160 training examples of class 0 provide significant
information for training a LSTM model. Figure 4.13 shows the best learning curves obtained
from the best undersampling strategies of ensemble learning with LSTM.
Figure 4.15 provides a summary of the ten experimental runs using CNN-LSTM.
Figure 4.16: Confusion matrix and classification report of appr. 1, run 4 and appr. 2, run 1
The best undersamples of the ensemble learning in each run are shown in Figure 4.17.
Approach 3, segment 2 appears twice in the list along with approach 1, segment 5 and approach
2, segment 2 and approach 2, segment 3.
Figure 4.18 shows the best learning curves obtained from the best undersampling strategies of
ensemble learning with CNN-LSTM.
Ensemble Learning
Method
Approach 1 Approach 2 Approach 3
Algorithm CNN LSTM CNN-LSTM CNN LSTM CNN-LSTM CNN LSTM CNN-LSTM
Average
accuracy 0.9866 0.9874 0.9875 0.9900 0.9855 0.9826 0.9930 0.9905 0.9807
(10 runs)
Method SMOTE
K- Suppor
Decision Random Gradient
Algorithm CNN LSTM CNN-LSTM nearest Vector Naïve bayes
Trees Forest Boosting
neighbor Machine
Average
accuracy 0.9942 0.9830 0.9900 0.9813 0.9766 0.9703 0.9643 0.9601 0.9732
(10 runs)
Suppor
K-nearest Decision Random Gradient
Algorithm Vector Naïve bayes
neighbor Trees Forest Boosting
Machine
Average
accuracy 0.9742 0.9789 0.9692 0.9621 0.9735 0.9684
(10 runs)
SMOTE
Naïve bayes 817 48 2 16 0.9796
Original dataset
(No data balancing techniques)
Support Vector Machine 818 50 0 15 0.9830
It is evident from the above shown summary, CNN method works best both in terms of average
result and best result in ten experimental runs. For ensemble learning, approach 3 shows the
best accuracy with CNN and for SMOTE CNN works best among all algorithms.
4.4 Discussion
In this section, major limitations of this thesis as well as understanding of the result and how
it can be interpreted from manufacturing perspective are highlighted.
4.4.1 Event definition and subsequence extraction
The purpose of this thesis was to automatically detect any unusual event occurring on the
industrial drying hopper installed in the manufacturing shop floor of a polymer manufacturing
industry. As the raw dataset obtained from the machine interface was not in ideal structure for
using in a ML or DL algorithms, it needed certain preprocessing from the expert. Even after
primary preprocessing done by an expert in this field, the dataset still had missing values and
no labeling. The major hindrance in using the dataset for event detection was no accurate
definition of unusual event through which temperature profile can be visually divided in
various classes. This is why, the events needed to be defined at the very beginning of the
experiment considering all variations in the unusual events. Several assumptions had to be
made in order to maintain consistency in the definition of an event. For example, if there is a
certain small peak in any temperature value for three or four minutes, those are not defined as
events. There were two limitations in defining an event. First, no physics based model or
mechanics of the drying hopper was available from which the events and their structure like
temperature profiles can be understood. Second, in all possible scenarios which can be
identified as events, there are too many variations of them which can be potential candidates
of the events. For example, hopper 1 hopper outlet temperature varies around 150° F. In some
time steps, this temperature value goes below 100° F whereas all other sensors reading are
normal. Now the question is, whether this should be identified as events or non-events. This
type of confusion was universal in almost all cases where temperature is dropping down or
rising all on a sudden. So, events were defined based on the visualization of the temperature
profile. Whenever something unusual and lasting phenomenon was notified in the temperature
profile which is significantly different from the steady state condition were listed as events.
As mentioned in 3.1.2, the previous study on this particular drying hopper case [36] predicts
three different type of unusual events; dryer undersize, conveying issue and cleaning cycle. But
in the temperature profile, defining this individual event was not straightforward due to the
variation of these events. This is why, the goal of this thesis was to identify any unusual event
rather than identifying the type of the event. Another assumption had to made for defining an
event which is the start and end time of an event for simplification. The temperature profiles
were segmented in an hourly basis and each hour there are 60 entries or time steps. So, if
something unusual has started happening in an hour, regardless of the actual start time, that
particular hour is defined as an example of an event for labeling simplicity.
The window approach used in data preprocessing step for extracting subsequences and labeling
each subsequence from the labels of each row is also based on some assumptions. For the ease
of labeling a window length of sixty minutes and sliding step of 60 minutes were considered
for which 4416 examples or subsequences were obtained. As at the beginning each row was
assigned a label and events or non-events were defined hourly which indicates in each hour all
sixty minutes or rows have same labels. If the start time end time were defined more precisely
like the actual start and time instead of hourly basis, then other window approach like different
sliding step other than sixty or multiples of sixty can be used. For example, an event starts at
1:26 pm, so the hour from 1:01 pm to 2:00 pm was defined as an event and each row was
assigned the same label, 1. But it could have been done otherwise like from 1:01 pm to 1:25
pm, these 25 rows as label 0, remaining 35 rows; 1:26 pm to 2:00 pm as label 1 as the event
started from 1:26 pm. In this way, apart from 60 or its multiple, any other values could be used
for sliding step. But labeling each subsequence would become more complex from the labelled
rows. The reason is not all rows have same labels if this approach is used. For example, if
window length is sixty minutes, but sliding step is 1, first subsequence will start from 5:01 am,
May 2018, end at 6:00 am, May 1, 2018. But the second subsequence will start from 5:02 am
and will end at 6:01 am which was not the same for sliding step of sixty minutes. For sixty
minutes sliding step, second event starts at 6:01 am and ends at 7:00 am. Now, when moving
the sliding window using sliding step of 1 minute, at one point it will extract a subsequence
which starts from 12:39 pm and ends at 1:38 pm. As mentioned above, events started from 1:26
pm precisely, so the rows from 1:26 pm to 1:38 pm are labelled as 1, but rows from 12:39 pm
to 1:25 pm are labelled as 0. So, this subsequence has rows with both labels which makes it
difficult to define as an event or non-event. If events are defined as any 1 hour where at least
one row or minute was labelled as 1, this might lead to a problem. If in an hour at least 30 %
rows are labelled as 1, that hour can be defined as event as something unusual is happening in
that hour. But if only one row has label 1, rest of the rows have label 0, defining that hour as
an event is misleading. The issue is to determine the minimum percentage of rows which has
to be labelled as 1 to define that hour as an event. If this problem can be taken care of number
of examples would be much higher than the current number. For example, with sliding step of
1 and window length of 60, number of extracted subsequences would be, m = (264960-60)/1+1
= 264,901 according to the formula in 3.1.2.5.1, which is 463.88 % more than the current
number of examples. As this thesis is dependent on data driven modeling completely with not
much input about the physics of the events and non-events with no clear definition, the simplest
way to define and visualize the events was chosen for classification.
4.4.2 Data imbalance issue
As shown before, training set had 2742 examples from class 0 (77.61%) and 791 examples
from class 1 (22.39%) which indicates that the dataset is not balanced. During first
experimental trial using simple neural network, all test examples were identified as class 0 as
shown in Figure 4.1. From the literature review of the scientific journals and exploration of
various data analytics blogs like “towardsdatascience” [103], “analyticsvidhya” [104] and so
on, the issue was figured out as imbalance classification issue. Afterwards, CNN, LSTM and
CNN-LSTM were applied to the same imbalanced dataset, but resulted same as the simple
neural network. But machine learning algorithms like SVM, KNN and others were working
fine with the imbalanced dataset. This is why data imbalance issue handling techniques like
ensemble learning with undersampling and SMOTE as an oversampling technique was used
for deep learning algorithms and the result turned out very reasonable. For machine learning
algorithms, apart from regular dataset, SMOTE was also used as a data augmentation technique
in order to check the performance of ML algorithms. Undersampling was used to combine all
undersampling models as an ensemble learner. Without combination of all undersamples, the
performance of an algorithm cannot be determined properly using a single undersampling
approach. For example, if only one undersample from approach 3 was used for training the
algorithm and no ensemble learner was used, there is no certainty that the algorithm will
converge to the optimum without overfitting during training with this dataset and will perform
well on validation set and test set as shown in Figure 4.23. Even if one undersample performs
well on new data, there is still uncertainty regarding the stochastic nature of deep learning
algorithm. This particular undersample might work well in one run, but there is no guarantee
that it will work well in all experimental runs. This is why ensemble learning with
undersampling provides very powerful result as it combines the output from undersamples with
majority voting. Through majority voting, even if one undersample out of three undersamples
works badly on the new dataset, ensemble learning will take care of this issue by using the
prediction from the majority voting. The problem appears when majority of the undersamples
perform bad as the voting favors the wrong prediction.
average accuracy across 10 runs. This is why the best method among ML algorithms while
using SMOTE is chosen as gradient boosting, not k nearest neighbor as it identifies more events
wrongly.
Table 4.5 summarizes the best result in terms of average accuracy of 10 runs. The best average
accuracy was found for CNN in SMOTE with 99. 42 % average accuracy. The FP and FN
values are also consistent in ten experimental runs while using CNN with only one outlier
where 18 non-events were classified as events (FN). Apart from this instance, number of FP
values varies between 0 and 3 whereas number of FN values varies between 0 and 6. The
second best model was also found using CNN in ensemble learning approach 3 with 99.30 %
average accuracy. The FP values are highly consistent across 10 runs and the number varies
between 1 and 2. But it has two outliers in the FN values with 11 and 19. The number of FN
varies between 0 and 7 which is also higher than the CNN using SMOTE.
Table 4.6 summarizes the best result obtained in 10 runs in terms of accuracy. Again, the best
result found in CNN with ensemble learning approach 3 and CNN with SMOTE. Both of these
two showed 99.89% accuracy with only one FP value and no FN value. So certainly, CNN is
the best algorithm to classify this dataset into two categories with both ensemble learning and
SMOTE.
Performance of LSTM and CNN-LSTM is also good enough which is not much lesser than the
CNN. But CNN shows not only high accuracy, but also consistence in less number of FP and
FN values across ten runs. ML algorithms performance is relatively worse than the deep
learning approaches. The maximum average found is 98.13% with KNN in terms of average
accuracy across 10 runs when no imbalance data handling technique is used. The second best
was SVM with 97.89% average accuracy. In terms of best accuracy among ten runs, KNN
provides 99.32 % accuracy (no data imbalance technique), but it wrongly classified 6 events.
The second best is KNN, SVM and GB (SMOTE) with 98.64% accuracy. KNN again suffers
from the high FP values, it has wrongly classified 12 events whereas both SVM and GB
wrongly classified only 1 event and 11 non-events. So among ML algorithms, SVM is selected
as the best one.
References
[1] “What is Artificial Intelligence? How Does AI Work? | Built In.”
https://builtin.com/artificial-intelligence (accessed May 27, 2021).
[2] E. Oztemel and S. Gursev, “Literature review of Industry 4.0 and related technologies,”
J. Intell. Manuf., vol. 31, no. 1, pp. 127–182, 2020, doi: 10.1007/s10845-018-1433-8.
[3] C. Y. Hsu and W. C. Liu, “Multiple time-series convolutional neural network for fault
detection and diagnosis and empirical study in semiconductor manufacturing,” J. Intell.
Manuf., vol. 32, no. 3, pp. 823–836, 2021, doi: 10.1007/s10845-020-01591-0.
[4] S. S. Jones et al., “A multivariate time series approach to modeling and forecasting
demand in the emergency department,” J. Biomed. Inform., vol. 42, no. 1, pp. 123–139,
Feb. 2009, doi: 10.1016/j.jbi.2008.05.003.
[5] Z. Du, W. R. Lawrence, W. Zhang, D. Zhang, S. Yu, and Y. Hao, “Interactions between
climate factors and air pollution on daily HFMD cases: A time series study in
Guangdong, China,” Sci. Total Environ., vol. 656, pp. 1358–1364, Mar. 2019, doi:
10.1016/j.scitotenv.2018.11.391.
[6] C. Pérez-D’Arpino and J. A. Shah, “Fast target prediction of human reaching motion for
cooperative human-robot manipulation tasks using time series classification,” in 2015
IEEE International Conference on Robotics and Automation (ICRA), 2015, pp. 6175–
6182, doi: 10.1109/ICRA.2015.7140066.
[7] N. Maknickienė, A. V. Rutkauskas, and A. Maknickas, “Investigation of financial
market prediction by recurrent neural network,” Innov. Technol. Sci. Bus. Educ., vol. 2,
no. 11, pp. 3–8, 2011.
[8] L. Martín, L. F. Zarzalejo, J. Polo, A. Navarro, R. Marchante, and M. Cony, “Prediction
of global solar irradiance based on time series analysis: Application to solar thermal
power plants energy production planning,” Sol. Energy, vol. 84, no. 10, pp. 1772–1781,
Oct. 2010, doi: 10.1016/j.solener.2010.07.002.
[9] J. F. Muth, “Optimal Properties of Exponentially Weighted Forecasts,” J. Am. Stat.
Assoc., vol. 55, no. 290, pp. 299–306, 1960, doi: 10.1080/01621459.1960.10482064.
[10] A. Bagnall, J. Lines, A. Bostrom, J. Large, and E. Keogh, “The great time series
classification bake off: a review and experimental evaluation of recent algorithmic
advances,” Data Min. Knowl. Discov., vol. 31, no. 3, pp. 606–660, 2017, doi:
10.1007/s10618-016-0483-9.
[11] D. J. Berndt and J. Clifford, “Using dynamic time warping to find patterns in time
series.,” in KDD workshop, 1994, vol. 10, no. 16, pp. 359–370.
[12] G. E. P. Box, G. M. Jenkins, G. C. Reinsel, and G. M. Ljung, Time series analysis:
forecasting and control. John Wiley & Sons, 2015.
[13] G. He, Y. Li, and W. Zhao, “An uncertainty and density based active semi-supervised
learning scheme for positive unlabeled multivariate time series classification,”
Knowledge-Based Syst., vol. 124, pp. 80–92, 2017, doi: 10.1016/j.knosys.2017.03.004.
[14] M. L. Tuballa and M. L. Abundo, “A review of the development of Smart Grid
technologies,” Renewable and Sustainable Energy Reviews, vol. 59. Elsevier Ltd, pp.
710–725, Jun. 01, 2016, doi: 10.1016/j.rser.2016.01.011.
classification with temporal abstractions,” Proc. 22nd Int. Florida Artif. Intell. Res. Soc.
Conf. FLAIRS-22, pp. 344–349, 2009.
[56] B. Zhao, H. Lu, S. Chen, J. Liu, and D. Wu, “Convolutional neural networks for time
series classification,” J. Syst. Eng. Electron., vol. 28, no. 1, pp. 162–169, 2017, doi:
10.21629/JSEE.2017.01.18.
[57] A. P. Ruiz, M. Flynn, J. Large, M. Middlehurst, and A. Bagnall, “The great multivariate
time series classification bake off: a review and experimental evaluation of recent
algorithmic advances,” Data Min. Knowl. Discov., vol. 35, no. 2, pp. 401–449, Mar.
2021, doi: 10.1007/s10618-020-00727-3.
[58] W. Song, L. Liu, M. Liu, W. Wang, X. Wang, and Y. Song, “Representation Learning
with Deconvolution for Multivariate Time Series Classification and Visualization,”
Commun. Comput. Inf. Sci., vol. 1257 CCIS, pp. 310–326, 2020, doi: 10.1007/978-981-
15-7981-3_22.
[59] K. S. Kiangala and Z. Wang, “An Effective Predictive Maintenance Framework for
Conveyor Motors Using Dual Time-Series Imaging and Convolutional Neural Network
in an Industry 4.0 Environment,” IEEE Access, vol. 8, pp. 121033–121049, 2020, doi:
10.1109/ACCESS.2020.3006788.
[60] T. N. Sainath, O. Vinyals, A. Senior, and H. Sak, “Convolutional, Long Short-Term
Memory, fully connected Deep Neural Networks,” ICASSP, IEEE Int. Conf. Acoust.
Speech Signal Process. - Proc., vol. 2015-Augus, pp. 4580–4584, 2015, doi:
10.1109/ICASSP.2015.7178838.
[61] E. Y. Hsu, C. L. Liu, and V. S. Tseng, “Multivariate time series early classification with
interpretability using deep learning and attention mechanism,” in Lecture Notes in
Computer Science (including subseries Lecture Notes in Artificial Intelligence and
Lecture Notes in Bioinformatics), Apr. 2019, vol. 11441 LNAI, pp. 541–553, doi:
10.1007/978-3-030-16142-2_42.
[62] M. Khan, H. Wang, A. Ngueilbaye, and A. Elfatyany, “End-to-end multivariate time
series classification via hybrid deep learning architectures,” Pers. Ubiquitous Comput.,
2020, doi: 10.1007/s00779-020-01447-7.
[63] A. M. Tripathi, “Enhancing Multivariate Time Series Classification Using LSTM and
Evidence Feed Forward HMM,” Proc. Int. Jt. Conf. Neural Networks, 2020, doi:
10.1109/IJCNN48605.2020.9207636.
[64] G. He, Y. Duan, Y. Li, T. Qian, J. He, and X. Jia, “Active learning for multivariate time
series classification with positive unlabeled data,” Proc. - Int. Conf. Tools with Artif.
Intell. ICTAI, vol. 2016-Janua, pp. 178–185, 2016, doi: 10.1109/ICTAI.2015.38.
[65] M. González, C. Bergmeir, I. Triguero, Y. Rodríguez, and J. M. Benítez, “Self-labeling
techniques for semi-supervised time series classification: an empirical study,” Knowl.
Inf. Syst., vol. 55, no. 2, pp. 493–528, 2018, doi: 10.1007/s10115-017-1090-9.
[66] L. Wei and E. Keogh, “Semi-supervised time series classification,” in Proceedings of
the 12th ACM SIGKDD international conference on Knowledge discovery and data
mining, 2006, pp. 748–753.
[67] “Unix time - Wikipedia.” https://en.wikipedia.org/wiki/Unix_time (accessed Jun. 10,
2021).
[68] L. Gruenwald, H. Chok, and M. Aboukhamis, “Using data mining to estimate missing
sensor data,” in Proceedings - IEEE International Conference on Data Mining, ICDM,
2007, pp. 207–212, doi: 10.1109/ICDMW.2007.103.
[69] “Labelling Time Series Data in Python | by Lucy Rothwell | Towards Data Science.”
https://towardsdatascience.com/labelling-time-series-data-in-python-af62325e8f60
(accessed Jun. 11, 2021).
[70] J. Chris Bishop, C. Bishop, G. Hinton, and P. Bishop, “Neural networks for pattern
recognition. Advanced texts in econometrics.” Oxford: Clarendon Press, 1995.
[71] “Undersampling Algorithms for Imbalanced Classification.”
https://machinelearningmastery.com/undersampling-algorithms-for-imbalanced-
classification/ (accessed Jun. 12, 2021).
[72] “The 5 Most Useful Techniques to Handle Imbalanced Datasets - KDnuggets.”
https://www.kdnuggets.com/2020/01/5-most-useful-techniques-handle-imbalanced-
datasets.html (accessed Jun. 12, 2021).
[73] “Random Oversampling and Undersampling for Imbalanced Classification.”
https://machinelearningmastery.com/random-oversampling-and-undersampling-for-
imbalanced-classification/ (accessed Jun. 12, 2021).
[74] N. V. Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer, “SMOTE: Synthetic
Minority Over-sampling Technique,” J. Artif. Intell. Res., vol. 16, pp. 321–357, Jun.
2011, doi: 10.1613/jair.953.
[75] “Imbalanced Learning: Foundations, Algorithms, and Applications - Google Books.”
https://books.google.com/books?hl=en&lr=&id=CVHx-
Gp9jzUC&oi=fnd&pg=PT9&dq=Imbalanced+Learning:+Foundations,+Algorithms,+a
nd+Applications+1st+Edition&ots=2iMkJjGobj&sig=ydvxXpVL7gZa66NLKwKctoE
wyJw#v=onepage&q&f=false (accessed Jun. 12, 2021).
[76] “Bank Data: SMOTE. This will be a short post before we… | by Zaki Jefferson |
Analytics Vidhya | Medium.” https://medium.com/analytics-vidhya/bank-data-smote-
b5cb01a5e0a2 (accessed Jun. 12, 2021).
[77] “(34) Handling imbalanced dataset in machine learning | Deep Learning Tutorial 21
(Tensorflow2.0 & Python) - YouTube.”
https://www.youtube.com/watch?v=JnlM4yLFNuo&t=1914s (accessed Jun. 12, 2021).
[78] “Perceptron - Wikipedia.” https://en.wikipedia.org/wiki/Perceptron (accessed May 28,
2021).
[79] “[Memo Sheet] Deep Neural Network. Have you ever dreamed of a place where… | by
Harry Pommier | Zenika.” https://medium.zenika.com/memo-sheet-deep-neural-
network-dedcda759d9c (accessed May 28, 2021).
[80] “Activation Functions for Artificial Neural Networks - mlxtend.”
http://rasbt.github.io/mlxtend/user_guide/general_concepts/activation-functions/
(accessed May 28, 2021).
[81] “Difference Between a Batch and an Epoch in a Neural Network.”
https://machinelearningmastery.com/difference-between-a-batch-and-an-epoch/
(accessed May 28, 2021).
Appendix
N.B. Implementation in python is inspired from YouTube Channel “Codebasics” and
website “https://machinelearningmastery.com/”.
Ensemble Learning (CNN, LSTM, CNN-LSTM):
……………………………………………………………………………………………….
import numpy as np
import pandas as pd
from numpy import array
labelled_data_file_name = 'Hopper_labelled_data.csv'
dataset = pd.read_csv(labelled_data_file_name, skiprows=0)
dataset[' Delivery Air Dewpoint (F)']= pd.to_numeric(dataset[' Delivery Air Dewpoint (F)'],
errors = 'coerce')
dataset['Regen Temp Wheel Inlet (F)']= pd.to_numeric(dataset['Regen Temp Wheel Inlet
(F)'], errors = 'coerce')
dataset['Hopper 1 Hopper Outlet Temp (F)']= pd.to_numeric(dataset['Hopper 1 Hopper Outlet
Temp (F)'], errors = 'coerce')
dataset['Hopper 1 Drying Monitor 2 Temp (F)']= pd.to_numeric(dataset['Hopper 1 Drying
Monitor 2 Temp (F)'], errors = 'coerce')
dataset['Hopper 1 Drying Monitor 4 Temp (F)']= pd.to_numeric(dataset['Hopper 1 Drying
Monitor 4 Temp (F)'], errors = 'coerce')
dataset['Hopper 1 Drying Monitor 6 Temp (Top) (F)']= pd.to_numeric(dataset['Hopper 1
Drying Monitor 6 Temp (Top) (F)'], errors = 'coerce')
dataset['labels']= pd.to_numeric(dataset['labels'], errors = 'coerce')
dataset.columns = ['DAD', 'RTAS', 'RTWI', 'RTWO', 'H1DAT', 'H1HOT', 'H1DM1T',
'H1DM2T','H1DM3T', 'H1DM4T', 'H1DM5T', 'H1DM6T', 'labels']
dataset['DAD'].fillna(method='pad', inplace=True)
dataset['RTWI'].fillna(method='pad', inplace=True)
dataset['H1HOT'].fillna(method='pad', inplace=True)
dataset['H1DM2T'].fillna(method='pad', inplace=True)
dataset['H1DM4T'].fillna(method='pad', inplace=True)
dataset['H1DM6T'].fillna(method='pad', inplace=True)
columns_scaling = ['DAD', 'RTAS', 'RTWI', 'RTWO', 'H1DAT', 'H1HOT', 'H1DM1T',
'H1DM2T','H1DM3T', 'H1DM4T', 'H1DM5T', 'H1DM6T']
from sklearn.preprocessing import MinMaxScaler
Md Mushfiqur Rahman, MS in Industrial Engineering, IMSE, WVU - 78 -
Appendix
scaler = MinMaxScaler()
dataset[columns_scaling] = scaler.fit_transform(dataset[columns_scaling])
rows, columns = dataset.shape
count_test = int((rows/60)*0.2)*60
count_train = rows - count_test
dataset_train = dataset[:count_train]
dataset_test = dataset[count_train:]
dataset_train = np.array(dataset_train)
dataset_test = np.array(dataset_test)
# split a multivariate sequence into samples
def split_sequences(sequences, n_steps):
X, y = list(), list()
for i in range(len(sequences)):
# find the end of this pattern
end_ix = 60*i + n_steps
# check if we are beyond the dataset
if end_ix > len(sequences):
break
# gather input and output parts of the pattern
seq_x, seq_y = sequences[60*i:end_ix, :-1], sequences[end_ix-1, -1]
X.append(seq_x)
y.append(seq_y)
return array(X), array(y)
# choose a number of time steps
n_steps = 60
# convert into input/output
trainX_pre, trainy_pre = split_sequences(dataset_train, n_steps)
trainy_pre = trainy_pre.reshape(trainy_pre.shape[0],1)
trainX_pre = np.asarray(trainX_pre).astype(np.float32)
testX, testy = split_sequences(dataset_test, n_steps)
testy = testy.reshape(testy.shape[0],1)
testX = np.asarray(testX).astype(np.float32)
print(trainX_pre.shape, trainy_pre.shape, testX.shape, testy.shape)
dim1 =trainX_pre.shape[0]
dim2 =trainX_pre.shape[1]
dim3 =trainX_pre.shape[2]
trainX_pre = trainX_pre.reshape(dim1, dim2*dim3)
trainX_pre = pd.DataFrame(trainX_pre)
trainy_pre = pd.DataFrame(trainy_pre)
trainy_pre.columns = ['labels']
dataframe = pd.concat([trainX_pre, trainy_pre], axis =1)
count_train0, count_train1 = dataframe.labels.value_counts()
dataframe_train0 = dataframe[dataframe['labels']==0]
dataframe_train1 = dataframe[dataframe['labels']==1]
dataframe_train0.shape, dataframe_train1.shape
def train_set(df_majority, df_minority, start, end):
df_train = pd.concat([df_majority[start:end], df_minority], axis =0)
df_train = df_train.sample(frac=1)
trainX = df_train.drop(['labels'], axis =1)
trainy = df_train['labels']
trainX = np.array(trainX)
trainy = np.array(trainy)
new_dim = trainX.shape[0]
trainX = trainX.reshape(new_dim, dim2, dim3)
trainy = trainy.reshape(trainy.shape[0],1)
return trainX, trainy
plt.title('model accuracy')
plt.ylabel('accuracy')
plt.xlabel('epoch')
plt.legend(['train', 'validation'], loc='upper left')
plt.show()
# summarize history for loss
plt.plot(history.history['loss'])
plt.plot(history.history['val_loss'])
plt.title('model loss')
plt.ylabel('loss')
plt.xlabel('epoch')
plt.legend(['train', 'validation'], loc='upper left')
plt.show()
print(model.evaluate(testX, testy, batch_size=64))
yp = model.predict(testX)
y_pred =[]
for element in yp:
if element>0.5:
y_pred.append(1)
else:
y_pred.append(0)
print(classification_report(testy, y_pred))
cm = tf.math.confusion_matrix(labels = testy, predictions =y_pred)
plt.figure(figsize =(10,6))
sn.heatmap(cm, annot=True, fmt='d')
plt.xlabel('Predicted')
plt.ylabel('Truth')
return y_pred
if n_ones>2:
y_pred_final[i]=1
else:
y_pred_final[i]=0
return y_pred_final
y_pred_final1_CNN = final_result1(y_pred11_CNN, y_pred12_CNN, y_pred13_CNN,
y_pred14_CNN, y_pred15_CNN)
print(classification_report(testy, y_pred_final1_CNN))
cm1_CNN = tf.math.confusion_matrix(labels = testy, predictions =y_pred_final1_CNN)
plt.figure(figsize =(10,6))
sn.heatmap(cm1_CNN, annot=True, fmt='d')
plt.xlabel('Predicted')
plt.ylabel('Truth')
LSTM model cell:
n_timesteps, n_features, n_outputs = trainX21.shape[1], trainX21.shape[2], trainy21.shape[1]
model = Sequential()
model.add(LSTM(100, input_shape=(n_timesteps,n_features)))
model.add(Dropout(0.5))
model.add(Dense(200, activation='relu'))
model.add(Dense(n_outputs, activation='sigmoid'))
model.summary()
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
history = model.fit(trainX21, trainy21, validation_split = 0.1, epochs=10, batch_size=64)
print(history.history.keys())
# summarize history for accuracy
plt.plot(history.history['accuracy'])
plt.plot(history.history['val_accuracy'])
plt.title('model accuracy')
plt.ylabel('accuracy')
plt.xlabel('epoch')
plt.legend(['train', 'validation'], loc='upper left')
plt.show()
# summarize history for loss
plt.plot(history.history['loss'])
plt.plot(history.history['val_loss'])
plt.title('model loss')
plt.ylabel('loss')
plt.xlabel('epoch')
plt.legend(['train', 'validation'], loc='upper left')
plt.show()
print(model.evaluate(testX, testy, batch_size=64))
yp = model.predict(testX)
y_pred21 =[]
for element in yp:
if element>0.5:
y_pred21.append(1)
else:
y_pred21.append(0)
print(classification_report(testy, y_pred21))
cm = tf.math.confusion_matrix(labels = testy, predictions =y_pred21)
plt.figure(figsize =(10,6))
sn.heatmap(cm, annot=True, fmt='d')
plt.xlabel('Predicted')
plt.ylabel('Truth')
CNN-LSTM cell:
n_timesteps, n_features, n_outputs = trainX11.shape[1], trainX11.shape[2], trainy11.shape[1]
n_steps, n_length = 4, 15
trainX11 = trainX11.reshape((trainX11.shape[0], n_steps, n_length, n_features))
testX = testX.reshape((testX.shape[0], n_steps, n_length, n_features))
model = Sequential()
model.add(TimeDistributed(Conv1D(filters=16, kernel_size=5, activation='relu'),
input_shape=(None,n_length,n_features)))
model.add(TimeDistributed(Dropout(0.5)))
model.add(TimeDistributed(Conv1D(filters=16, kernel_size=5, activation='relu')))
model.add(TimeDistributed(Dropout(0.5)))
model.add(TimeDistributed(MaxPooling1D(pool_size=2)))
model.add(TimeDistributed(Flatten()))
model.add(LSTM(100))
model.add(Dropout(0.5))
model.add(Dense(200, activation='relu'))
model.add(Dense(n_outputs, activation='sigmoid'))
model.summary()
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
history = model.fit(trainX11, trainy11, validation_split = 0.1, epochs=10, batch_size=64)
print(history.history.keys())
# summarize history for accuracy
plt.plot(history.history['accuracy'])
plt.plot(history.history['val_accuracy'])
plt.title('model accuracy')
plt.ylabel('accuracy')
plt.xlabel('epoch')
plt.legend(['train', 'validation'], loc='upper left')
plt.show()
# summarize history for loss
plt.plot(history.history['loss'])
plt.plot(history.history['val_loss'])
plt.title('model loss')
plt.ylabel('loss')
plt.xlabel('epoch')
plt.legend(['train', 'validation'], loc='upper left')
plt.show()
print(model.evaluate(testX, testy, batch_size=64))
yp = model.predict(testX)
y_pred11 =[]
for element in yp:
if element>0.5:
y_pred11.append(1)
else:
y_pred11.append(0)
print(classification_report(testy, y_pred11))
cm = tf.math.confusion_matrix(labels = testy, predictions =y_pred11)
plt.figure(figsize =(10,6))
sn.heatmap(cm, annot=True, fmt='d')
plt.xlabel('Predicted')
plt.ylabel('Truth')
SMOTE (CNN, LSTM, CNN-LSTM, ML algorithms) cells:
dim1 =trainX_pre.shape[0]
dim2 =trainX_pre.shape[1]
dim3 =trainX_pre.shape[2]
dim1, dim2, dim3
trainX_pre = trainX_pre.reshape(dim1, dim2*dim3)
trainX_pre.shape
from imblearn.over_sampling import SMOTE
smote = SMOTE(sampling_strategy = 'minority')
trainX_sm, trainy_sm = smote.fit_resample(trainX_pre, trainy_pre)
trainX_sm = pd.DataFrame(trainX_sm)
trainy_sm = pd.DataFrame(trainy_sm)
trainy_sm.columns = ['labels']
dataframe = pd.concat([trainX_sm, trainy_sm], axis =1)
dataframe = dataframe.sample(frac=1)
dataframe.labels.value_counts()
trainX = dataframe.drop(['labels'], axis =1)
trainy = dataframe['labels']
trainX = np.array(trainX)
trainy = np.array(trainy)
new_dim = trainX.shape[0]
trainX = trainX.reshape(new_dim, dim2, dim3)
trainy = trainy.reshape(trainy.shape[0],1)
trainX.shape, trainy.shape
print(trainX.shape, trainy.shape, testX.shape, testy.shape)
ML algorithms application cells:
from sklearn.metrics import accuracy_score
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.naive_bayes import GaussianNB
from sklearn.ensemble import BaggingClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import confusion_matrix, classification_report
import matplotlib.pyplot as plt
import tensorflow as tf
import seaborn as sn
def define_models(models=dict()):
models['knn'] = KNeighborsClassifier(n_neighbors=7)
models['cart'] = DecisionTreeClassifier()
models['svm'] = SVC(kernel = 'poly')
models['bayes'] = GaussianNB()
models['bag'] = BaggingClassifier(n_estimators=50)
models['rf'] = RandomForestClassifier(n_estimators=50)
models['et'] = ExtraTreesClassifier(n_estimators=100)
models['gbm'] = GradientBoostingClassifier(n_estimators=100)
print('Defined %d models' % len(models))
return models
def evaluate_model(trainX, trainy, testX, testy, model)
model.fit(trainX, trainy)
yhat = model.predict(testX)
cm = tf.math.confusion_matrix(labels = testy, predictions =yhat)
plt.figure(figsize =(10,6))
sn.heatmap(cm, annot=True, fmt='d')
plt.xlabel('Predicted')
plt.ylabel('Truth')
accuracy = accuracy_score(testy, yhat)
print(classification_report(testy, yhat))
return accuracy * 100.0
def evaluate_models(trainX, trainy, testX, testy, models):
results = dict()
for name, model in models.items():
# evaluate the model
results[name] = evaluate_model(trainX, trainy, testX, testy, model)
# show process
print('>%s: %.3f' % (name, results[name]))
return results
def summarize_results(results, maximize=True):
# create a list of (name, mean(scores)) tuples
mean_scores = [(k,v) for k,v in results.items()]
# sort tuples by mean score
mean_scores = sorted(mean_scores, key=lambda x: x[1])
# reverse for descending order (e.g. for accuracy)
if maximize:
mean_scores = list(reversed(mean_scores))
print()
for name, score in mean_scores:
print('Name=%s, Score=%.3f' % (name, score))
models = define_models()
results = evaluate_models(trainX, trainy, testX, testy, models)
summarize_results(results)