0% found this document useful (0 votes)
18 views

Testing and Analysis of Predictive ML Algorithms-Pages-421-443

This document discusses testing and analyzing the predictive capabilities of various machine learning algorithms. It aims to develop an interface that can compare the performance of different algorithms on various datasets to determine the most appropriate algorithm for a given predictive task. The paper focuses on comparing five time series forecasting algorithms (linear regression, K-nearest neighbor, Auto ARIMA, Prophet, and support vector machine) on stock market, sales, and earth data to identify the best algorithm for each type of predictive problem.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views

Testing and Analysis of Predictive ML Algorithms-Pages-421-443

This document discusses testing and analyzing the predictive capabilities of various machine learning algorithms. It aims to develop an interface that can compare the performance of different algorithms on various datasets to determine the most appropriate algorithm for a given predictive task. The paper focuses on comparing five time series forecasting algorithms (linear regression, K-nearest neighbor, Auto ARIMA, Prophet, and support vector machine) on stock market, sales, and earth data to identify the best algorithm for each type of predictive problem.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 23

Testing and Analysis of Predictive

Capabilities of Machine Learning


Algorithms

Ganesh Khekare , Lokesh Kumar Bramhane , Chetan Dhule ,


Rahul Agrawal , and Anil V. Turukmane

Abstract The use of machine learning algorithms in last decade has been enor-
mously increased. It has opened the door to several opportunities in various fields of
research and business. However, identifying the appropriate algorithm for a partic-
ular program has always been an enigma, and that necessitates to be solved ere the
development of any machine learning system. Let’s take the example of the weather
forecasting prediction system, it is used to identify the future weather prediction of a
particular country or continent. Now, it is a daunting task to find the right algorithm
or model for such a purpose that can predict accurate values. There are several other
systems such as recommendation systems, sales prediction of a mega-store, stock
prediction systems, or predicting what are the chances of a driver meeting an acci-
dent based on his past records and the road they’ve taken. These problem statements
require to be resolved using the most suitable algorithm and identifying them is a
necessary task. The objective of proposed system is to develop an interface that can
be used to display the result matrix of different machine learning algorithms after
being exposed to different datasets with different features. The system compares a
set of machine learning algorithms while determining the appropriate algorithm for
the selected predictive system using the required data sets. Stock market, earth and
sales forecasting data is used for analysis. For experimental performance analysis
several technologies and tools are used including Python, Django, Jupyter Notebook,
Machine Learning, Data Science methodologies, etc. The comparative performance
analysis of best known five time series forecasting machine learning algorithms viz.

G. Khekare (B) · A. V. Turukmane


Parul University, Vadodara, India
e-mail: [email protected]
A. V. Turukmane
e-mail: [email protected]
L. K. Bramhane
National Institute of Technology, Goa, India
e-mail: [email protected]
C. Dhule · R. Agrawal
G H Raisoni College of Engineering, Nagpur, India
e-mail: [email protected]

© The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 419
E. H. Houssein et al. (eds.), Integrating Meta-Heuristics and Machine Learning
for Real-World Optimization Problems, Studies in Computational Intelligence 1038,
https://doi.org/10.1007/978-3-030-99079-4_16
420 G. Khekare et al.

linear regression, K—nearest neighbor, Auto ARIMA, Prophet, and Support Vector
Machine is done.

Keywords Best known machine learning algorithms · Survey · Experimentation ·


Performance comparison · Stock market prediction · Earth · And sales
forecasting · Predictive analysis

1 Introduction

The system mainly concentrates on machine learning algorithms that are used in
prediction modeling. Machine learning algorithms are self-programming methods
to deliver better results after being exposed to data. The learning portion of machine
learning signifies that the models which are build changes according to the data that
they encounter over the time of fitting.
The idea behind the building of this system was to determine which one among the
chosen time series forecasting algorithms are the most suitable for these operations.
The uniqueness of this work is specified using the help of the literature review section
of this study. The five algorithms that were chosen are Linear Regression, K-Nearest
Neighbor, Auto ARIMA, Support Vector Machine, and Facebook’s Prophet, which
were never compared altogether on a common platform. Also, several datasets were
extracted for building and testing these models, along with the evaluation metrics.
Since the extracted datasets are time-series forecasting types, that’s why algo-
rithms that are most suitable for these kinds of works are chosen in this system. The
term time series forecasting means that the system is going to make a prediction based
on time-series data. Time series data are those where records are indexed based on
time, that can be anything like a proper date, a timestamp, quarter, term, or year. In
this type of forecasting, the date column is used as a predictor/independent variable
for predicting the target value.
A machine learning algorithm builds a model with the help of a dataset by getting
trained and tested. The dataset is split into two parts as train and test datasets, and
generally, the record of these two do not overlap, and there are different mechanisms
around machine learning for this task. After fitting/training the model based on the
train portion, it must be tested, and for that, the test dataset comes into play. Further,
the results that are generated are matched with the desired targets with the help of
evaluation metrics. The two-evaluation metrics viz., the Mean Absolute Percentage
Error and the Root Mean Squared Error are considered for comparison purpose is
broadly discussed in the Methodology chapter.
The last focus of this development is the interface that shows the most appropriate
machine algorithm when they are built over different datasets for different purposes.
Also, as a result, it determines the most appropriate algorithm for a specific database
of required forecasting functions used in businesses and markets. This part of the
system is implemented using a high-level Web-Framework called Django (based on
Python) that encourages fast development, clean, and insolent design. Along with
Testing and Analysis of Predictive … 421

that, Cascading Style Sheets (CSS) is used with Hyper Test Markup Language to
make visuals compatible attractive. Also, Bootstrap which is a known HTML, CSS,
and JavaScript Framework, is used for adding several features such as navbar and
hamburger to the interface.
The reason behind opting for this analytical work was due to the fact that during
a study of data science models the team found difficulty in choosing appropriate
algorithms for a particular problem statement. They were having trouble determining
what algorithm would be the most preferable according to the provided dataset. And
that wasn’t the one-time problem, it also occurred several times in the past. So,
the team decided to develop something or work on something that can solve this
concern out. And the crew was sure that this problem must have transpired with
several developers and machine learning practitioners in the past. The only way to
solve this query was by doing a comparative analysis of a bunch of algorithms that
were designed for the same purpose, and from here the idea of working on this
problem statement came into consideration. Nevertheless, here in this system, five
algorithms were chosen to be used, along with three unique time-series datasets
and two evaluation metrics. Though the enormous advancement has been done still
algorithms are not able to predict the things perfectly and that’s the main motivation
of doing this research work. A system must be there which will identify the best
algorithm as per given data.
The upcoming sections of this study concentrate on the foremost purpose of this
work, the reason for choosing this study along with mentioning every possible point
regarding the implementation, the scope of the work, it’s summarization, details about
the hardware and software used. The study is going to address the methodology of
the development of such an immense application. Also, it is going to explore the
tools and other requirements collected for the same. Later, this study will inform
the readers about the design and implementation of the final system. Further, it will
proceed to the conclusion the team reached after the development of the system along
with meaningful results.

2 Literature Review

This section of the study delivers the opinion and conclusion of several researchers
who contributed their works to the field of machine learning algorithms. Also, this
section manifests the comparative outcomes of the machine learning algorithms
that are mentioned in the previous section. Literature review refers to the idea of
reading the previous works related to the field of study and try to understand what
fundamentals have been taken into the consideration to deliver useful conclusions. All
conclusions in the upcoming paragraphs are based on individual work of respective
journal authors, and this study tried to present them in a way that it’s going to help
the team in further analysis.
Vansh Jatana mentioned in his paper Machine Learning Algorithms [1] that
Machine Learning is a branch of AI which allows System to train and learn from the
422 G. Khekare et al.

past data and activities. Also, it explores a bunch of regression, classification, and
clustering algorithms through several parameters including the memory size, overfit-
ting tendency, time for learning, and time for predicting. In the comparison of Random
Forest, Boosting, SVM, and Neural Networks, the time for learning is weaker in the
case of Linear Regression. Also, like Logistic Regression and Naive Bayes [2], the
overfitting tendency of Linear Regression is low. However, in the research Linear
regression is the only pure regression model, as else are Classification as well as
Clustering model too.
Another journal [2] studies the significant machine learning algorithms that are
applied to sample datasets taken from the medical domain. It explores algorithms
such as Random Forest, Decision Tree, K-Nearest Neighbour (KNN), Naive Bayes,
Support vector machine, K-Means, Apriori, Reinforcement Learning, and PCA. And
among, them KNN is the algorithm that is implemented in the system. KNN is a
supervised non-parametric learning algorithm that is utilized for both regression
and classification. On comparing the supervised learning algorithm based on perfor-
mance, KNN showed the best result among all with an accuracy of 80.52%, and the
lowest training time 0.0009 s. However, it’s prediction time was largest at 0.003 s.
Ariruna Dasgupta and Asoke Nath [3] discuss the broader classification of a promi-
nent machine learning algorithm in their journal and specifies the new applications
of them. In supervised learning, priori is necessary and always produces the same
output for specific input. Similarly, Reinforcement learning requires priori too, but
the output changes if the environment doesn’t remain the same for a specific result.
Nevertheless, Unsupervised Learning doesn’t require priori.
Talking about Auto ARIMA, Prapanna Mondal, Labani Shit, and Saptarsi
Goswami [4] in their paper carried a study on 56 stocks from 07 divisions. Stocks
that are registered in the National Stock Exchange (NSE) are considered. The authors
have chosen 23 months of information for the observational research. They’ve calcu-
lated the perfection of the ARIMA model in prediction of stock costs. For all the
divisions, the ARIMA model’s accuracy in anticipating stock costs is higher than
eighty fifths, which symbolizes that ARIMA provides sensible accuracy. If they tend
to address concerning specific divisions, statement stocks in the FMCG division
victimization ARIMA model offer results with the most reliable accuracy.
However, the predictions’ accuracy for the banking division and automobile divi-
sion victimization ARIMA model is comparatively less than that of the other divi-
sions. Therefore, they proposed a more potent model concerning statement stocks of
the businesses in the said sector. Concerning the IT sector, the quality deviation isn’t
too low or not too high, whereas they tend to have gotten AN higher than the ninetieth
accuracy in prediction for this sector. This prompts the conclusion that ARIMA is a
lot reliable for big numeric data prediction or stock analysis.
A work by Kemal Korjenić, Kerim Hodžić, and Dženana Ðonk [5] evaluates
its performance in very real-world use cases. The prophet model has inclinations of
generating conventional monthly as well as quarterly forecasts. Also, as an enormous
potential for classification of the portfolio into many classes consistent with the
expected level of statement authenticity: some five-hundredths of the merchandise
portfolio (with large amount of dataset) will be projected with MAPE < 30% monthly,
Testing and Analysis of Predictive … 423

whereas around 70% can be predicted with MAPE < 30% quarterly (out of that 40%
with MAPE < 15%).
It’s necessary to say that these some four-hundredth of the merchandise portfolio
which may be forecasted with MAPE < 15 August 1945 quarterly are mostly the
simplest mercantilism things of the said retail company, with quite eightieth of annual
share (financial) within the whole portfolio. Supported those facts, the achieved
results are quite satisfactory in the real-world situation of sales statement. Hence,
Prophet is adequate for the short and middle interim prediction.
Sibarama Panigrahi and H.S. Behra [6] used FTSF-DBN, FTSF-LSTM, and
FTSF-SVM models as comparative algorithms for their Fuzzy Time Series Fore-
casting (FTSF) in their journal. These Machine learning algorithms are used model
FLRs (Fuzzy Logic Relationships. The paper concluded that FTSF-DBN outper-
formed DBN (Deep Belief Network) method. But it also reported that the statistical
difference between FTSF-LSTM and LSTM is insignificant. However, FTSF-SVM
had better results than SVM, and FTSF-SVM statistically outperformed it. Finally,
they concluded that according to their results, FTSF algorithms provide statistically
fitter or commensurate outcomes against their crisp equivalents.
Forecasting Stocks [7] used the LSTM algorithm model approach to predict the
price of the stock and as a tool to generate a future insight into the stock of their
interest. The paper also sighted that they were able to generate good results when they
were considering 4 cases namely High, Low, Open, and Close. The paper showed
that the algorithm was able to deal with the stock market and make a prediction of
the stock price with great accuracy.
Comparative Analysis of Time-Series Forecasting Algorithms for Stock Price
Prediction [8] foretells the average stock price for five datasets by employing
the historical stock price data ranging from April 2009 to February 2019. Auto-
Regressive Integrated Moving Average (ARIMA) model is used to produce the
baseline, while Long Short-Term Memory (LSTM) networks are applied to produce
the forecasting model for prophesying the stock price. They found that on large
data samples, the ARIMA model outstands other models based on performance.
However, in the context of one years’ worth of data samples, the attention mecha-
nism improves the accuracy of the LSTM Model and exceeds the ARIMA Model.
They also concluded that the LSTM model is a much more powerful model than
most machine learning algorithms.
Another paper [9] showed the full-detailed process for the application of the
ARIMA algorithm for the prediction of the stock price. The detailed finding of the
application of this algorithm on the model let’s consider stock price prediction was
very satisfactory when being considered for a short time. Though it can’t be used for
a long-term prediction, but for the short-term, it can play a very crucial role in the
prediction of the stock price and forecasting for the investors. The paper also made
a test model on the stock of Nokia mobile in which the algorithm performed quite
well and showed minor deviation from real value.
Talking about K-Nearest Neighbour (KNN), it has been stated in a paper [10]
that KNN as a data mining algorithm has a broad range of use in regression and
classification scenarios. It is mostly used for data Mining or data categorization.
424 G. Khekare et al.

In Agriculture, it can be applied for simulating daily precipitations and weather


forecasts. KNN can be used efficiently [11] in determining required patterns and
correlations between data. Along with those other techniques such as hierarchical
clustering and k-means, regression models, ARIMA, [12] and decision tree analysis
can also be applied over this massive field of exploration. Also, KNN can be applied
medical field in order to predict the reason for a patient’s admission to the hospital.
This massive range of applications of KNN makes it a worth-researching algorithm,
and that’s why the team decided to include it in the list of the comparative algorithm
[13].
In the last few paragraphs, reader can observe that several researchers tried to
present their working to show some useful results. Some of the journals were focusing
on the individual machine learning algorithms that are used for time-series fore-
casting, while other work shown a relative comparison between number of algo-
rithms and tried to conclude which one was the best among all. Whatsoever, these
all algorithms have different fundamental of working and blending the environment,
and it all require an ample amount of time to study. The methodology section of this
documentation discusses all the chosen machine learning algorithms in brief, and it
tried to make reader comfortable with all the different techniques that are being used.
In the end, the whole analysis of the different journals published in recent years
features a broad perspective of different machine learning algorithms specifically
time series and prediction algorithms, that are about to be featured in the imple-
mentation of this system. Also, from the above study, it can be concluded that each
algorithm belongs to different categories and have significant applications. Further,
some of the comparative studies define the best machine learning techniques based
on several parameters. Nevertheless, in this whole process of encountering the bril-
liant works, team never came across any work where five algorithms that they’ve
chosen being compared in on one platform with common dataset, and that’s why the
team saw this as an opportunity to compare these five algorithms that are different
nature but also share some similarities so that they can be used for time series fore-
casting as well. The upcoming section is going to discuss all these algorithms and
the two-evaluation metrics brief, and later, reader will find the implementation of
them in brief.

3 Methodology

The idea was to create an interface that could display result matrix and multiple
analysis with words, numbers, statistics, and pictorial representations [14]. The visual
interface created by the team should not deviate from the topic for the audience and
should only include limited and necessary items such as what algorithms are used,
what dataset are used, their data analysis and respected comparative results. Anyway,
the construction of the interface was the ultimate concern in the entire research and
system construction campaign. To build the interface, the team needed statistics to
be displayed on a computer screen, in short, they needed the required results [15].
Testing and Analysis of Predictive … 425

The results [16] required for any production need for the models were to be devel-
oped. As the title of the work suggests that the system focuses on machine learning
algorithms, but in the later section it is clarified that the actual focus is on the Time
Series Forecasting type [17]. Time Series Prediction means that it predicts something
based on the time as its index feature. Time series datasets can be associated with
sales, stocks, weather, natural disasters, etc. containing historical records with some
minor errors [18].
But the aim was not to predict the results, the aim of the work was to find out which
machine learning model fits best in the datasets used. After going through several
previous works mentioned in the last chapter of the document, the team found that
none of the algorithms the team had chosen are compared on a common platform.
Also, it was best to compare a few unmatched algorithms to produce the best fit, and
to determine what are the nature of the individual datasets. The selected algorithms
are Linear Regression, K-Nearest Neighbour (KNN), Auto ARIMA, Facebook’s
Prophet, and Support Vector Machine. These algorithms and their functionality will
be discussed in this section later, but before that is necessary to know why these
algorithms were chosen.
In fact, the second reason is what type datasets are extracted in the development
path of this study. The first of the three databases were named ‘Tata Global Beverages
Limited’ published by the National Stock Exchange (NSE). The database has eight
positive features set in data analysis and model design. Also, it contained two hundred
and twenty records that will be used for model training and testing and to produce
results. Contains features such as Date, Open, Up, Down, Storage, Close, Trade, and
Revenue. By comparing all variables except the Total Trade against Date on one site,
the same trend can be found in the end. Therefore, the group decided to select any
of the variables as targeted variables.
The second data is called the Earthquake dataset, which describes important earth-
quakes that occurred between 1965 and 2016, published in the US Geological Survey.
The National Earthquake Information Center decides the area occupied and the extent
of all significant earthquakes that are occurring globally and broadcasts the erudition
straight to world organizations and countries, science organizations, and interested
people. This database is published from Kaggle and incorporates a history of the
date, time, intensity, magnitude, place, and origin of all earthquakes reported at 5.5
or higher since 1965.
The team used two variables with each other to predict the time series database.
The following analysis shows the different parameters and relationships between the
two variables that are used in the system. Although the database was empty, the team
used python’s library reverse_geocoder to measure it.
However, there were only two options for choosing target variables, either Depth or
Magnitude. The relationship between day/year and depth did not appear in categories
or lines. The Magnitude variation compared to the day turns out to be in phases
naturally and can be selected as the target modelling value. The third database used
in the system is a sales database called the ‘Superstore Dataset’ released from Tableau
(donated by Michael Martin). The database is a record for the sale of stationery and
other products in the United States and Canada from 2014 to 2017 with a record
426 G. Khekare et al.

sale of 9995 sales and has 20 different features. Although there were 20 variables,
there were only two options for the variables to be targeted, either Sales or Quantity
because Profit always seemed inconsistent. But later, further analysis concluded that
the relationship between Quantity and Day was significantly different than Sales and
Day. Therefore, Quantity has been selected as an independent variant of this section
of work.
After that, it was very important to clean and adjust the data sets before building
individual models. Database data clearly shows that there are a few missing records
and some unusual data in a few rows. The concept of data analysis states that one
must remove or alter those records to suit the models. Since the Earthquake dataset
contained records of disasters since 1965, some records were blank and vague. There
were different types of dates in the date column. To avoid such confusion, the team
had to store the data in a standard format i.e., dd-mm-yyyy. Similarly, the team
performed equal work on all the other databases that were extracted. Besides, it was
preferred to eliminate such empty records as it would not significantly affect the
outcome. In the end, the idea was clear that only two variables are going to be used
in each case.
Processing involves setting the appropriate Database Date as their reference and
arranging them in order. All these actions were performed on Python3 with the
help of various libraries such as Pandas, Numpy, Scikit-learn, and time data. It was
very important to create a ‘date’ map in ordinal as it plays an important role as the
dependent variable (datetime.toordinal () returns the date count for a specific date in
standard format). After that, it was all up to the implementation of the pre-selected
models for the system. The data needed to be analysed before it could be done to
produce useful comparison results. These are my machine learning algorithms as
follows:
Linear Regression: Linear regression is a simplistic and well-known Machine
Learning algorithm. It is a mathematical procedure that is applied for the prognos-
ticative analytical study. Simple Linear regression delivers forecasts for continuous
or numeric variables like trades, wages, span, goods worth, etc.
It manifests a linear correlation between a target and one or more explanatory
variables. Since the confers the linear correlation, which means it determines how
the value of the experimental variable is altering with the value of the predictor
variable.
Mathematically, it can be represented as shown in Eq. 1,

y = θ0 + θ1 x1 + θ2 x2 + . . . + θn xn (1)

Here, y is the target variable and x1 , x2 , …, xn are predictive variables that repre-
sents every other feature in a dataset. θ0 , θ1 , θ2 , …, θn represent the parameters that
can be calculated by fitting the model.
In the case of using two variables i.e., 1 independent and 1 dependent variable, it
can be represented as shown in Eq. 2,

y = θ0 + θ1 x (2)
Testing and Analysis of Predictive … 427

where parameters θ0 is said to be the intercept that forms on y-axis, and θ1 can be
generated once the model is trained.
K Nearest Neighbour: K-Nearest Neighbour calculates the similarity among the
recent data and recorded cases and sets the new records into the section where alike
data exists. It reserves all the recorded values and assigns the fresh record notch
based on the resemblance. When the fresh records arrive; it can be readily organized
into a prim fitting category through applying the algorithm.
It computes the length between the input and the test data and provides the
prognostication subsequently as shown in Eq. 3.

d( p, q) = d(q, p) = (q1 − p1)2 + (q2 − p2)2 + . . . + (qn − pn)2

 n

=  (qi − pi)2 (3)
i=1

The n number of specifications are taken into consideration. The marking that is
situated at the merest position from marking is in similar class. Here q and p are new
and existing data-points respectively.
Auto ARIMA: ARIMA is a standard word that refers to Auto-Regressive Inte-
grated Moving Average. It is a mere and efficient ML algorithm used to perform
time-series forecasting. It consists of two systems Auto Regression and Moving
average. Auto ARIMA is a time series model that employs former records to the
regression formula to prognosticate the following values for the succeeding action.
In short, it makes regression in the earlier time trail t-1 to foretell t. MA (moving
average) is used for measuring the mean in a distinct time-period and apportioning
it by the total number of frames reaped.
It takes past values into account for future prediction. There are 3 essential
parameters in ARIMA:
p = > historical data used for predicting the upcoming data.
q = > historical prediction faults i.e., used for forecasting the Upcoming data.
d = > Sequence of variation.
However, the parameter tuning for ARIMA utilizes a lot of time, but in Auto
ARIMA, the model itself generates the optimal p, d, and q values for the data set to
provide better predicting.
Prophet: The prophet is an open-source library by FB company made for
predicting time series data to learn and likely forecast the exchange. This depends on
the non-linear trends that are being fitted over the sections. The trend is the inclination
of the data to enhance or diminish over a while and it removes the periodical varia-
tions. Seasonality variations occur over a short duration and aren’t notable enough
to be described as a trend. The broad approach of the model is like a generalized
additive model. The equations related to the terms are defined as shown in Eq. 4,

fn(t) = g(t) + s(t) + h(t) + e(t) (4)


428 G. Khekare et al.

where,
g(t) = > trend.
s(t) = > seasonality.
h(t) = > forecast effected by holidays.
e(t) = > error term.
fn(t) = > the forecast.
The variation of the given terms is maths dependent. And if not studied properly
it might lead them to make the wrong prediction which may be very problematic to
the customer or for business in practice.
Prophet let us use two different models, the logistic growth model, and a piece-
wise linear model (by default). Selecting a model is important because it depends on a
lot of things like business size, rate of growth, business model, etc. On the off chance
that the data to be estimated has immersing and non-direct information (becomes
non-straight and after accomplishing the immersion, shows almost no blast or cut
back and best popular a couple of occasional changes), at that point the calculated
blast variant is the top-notch alternative. By and by, if the data demonstrates direct
homes and had blast or scaled back advancements inside the past, at that point a
piece-reasonable straight form can be a more prominent functional decision.
Support Vector Machine: The SVM [15] is a machine learning algorithm that
is employed for both regressions and classifications depending upon the enigmas.
Here, one creates a hyperplane depending upon the number of features; if there
are two features available, then straight line can be used to create a hyperplane. If
there are three features, then one can use the full three-dimensional plane to create a
hyperplane. The boundary of this hyperplane is created using positions or vectors that
are known as support vectors for the plane. And put the new point being generated
by us on the hyperplane.
In Linear SVM, features are linearly arranged that can utilize a simple straight
line to implement SVM in this case. The formula for obtaining hyperplane in this
case is as shown in Eq. 5:

y = mx + c (5)

If the feature that is being used is of non-linear type, then more dimensions are
needed to be added to it. And in that case, one need to use a plane. The formula for
obtaining hyperplane in this case is as shown in Eq. 6:

z = x2 + y2 (6)

After discussing the machine learning algorithms that were used, it is very essential
to know how it is going to be evaluated. The evaluation tells how much accurate the
model is, and how much good that model is.
In this system, to determine the accuracy, 2 evaluation metrics that are used for
generating results are Mean Absolute Percentage Error and Root Mean Squared Error,
and both depend on the obtained values and actual value. The Root Mean Squared
Testing and Analysis of Predictive … 429

Error a.k.a. RMSE value is obtained by taking the square root of the addition of the
individually calculated mean squared errors. The formula for the same is as shown
in Eq. 7:

 n  
 ŷi − yi 2
RMSE =  (7)
i=1
n

Here, ŷ1, ŷ2, ŷ3, …, ŷn are the actual value and y1, y2, y3…yn are respective
obtained value and n here is the number of iterations performed.
In MAPE or Mean Absolute Percentage Error, the value is calculated by taking
absolute subtraction of obtained value from actual value divided by the actual value,
later the individual value to obtain the result were added as shown in Eq. 8.

1  At − Ft
n
M= (8)
n t=1 At

Here, A1, A2, A3, …, An represents actual value, while F1, F2, F3, …,
Fn represents the obtained data, and n is the number of iterations taken under
consideration.

4 System Design

The design of the whole system depends on the flow of modules. The work is segre-
gated into six modules, and the team developed the whole system going through
these six modules that are discussed in this section of the study. Figure 1 describes
the modules and processes that are going to be involved in the long process of
implementation of the required interface.
Data Requirements and Collection.
In this phase of the whole implementation, the main objective is to understand
what kind of datasets are required in the massive process. Understanding the data
requirements plays a vital role in upcoming modules in this long process.
Further, after understanding the data requirements, the next step is to focus on the
collection of the required datasets. It gives a brief hint on what the model building
process will take place. The three datasets that were mentioned in the previous section
of this documentation are collected here.
Data Preparation.
This phase of the implementation is the most crucial. It let the implementer determine
the bruises in the collected data. To operate with the data, it needs to be developed in
430 G. Khekare et al.

Fig. 1 Flow of modules

a way that inscribes abstaining or fallacious values and eliminates copies and ensures
that it is accurately formatted for modeling.
Modelling.
In this module, the team implemented the algorithms as per the requirement in Python
with the help of some Python libraries. It is the phase, that allows to decide how the
information can be envisioned to find the solution that is required. All five algorithms
which are either predictive or descriptive that are mentioned in the previous section
were implemented here. Also, the team used the training set for predictive modeling,
as they act as a guideline to decide if the model requires to be calibrated. Besides
that, this module helped in processing different datasets through all the algorithms
on the table.
Model Evaluation.
Model’s assessment will probably assess the calculations that are actualized in the
past module. It is intended to decide the right logical methodology or strategy to take
care of the issue. Assessment permits the nature of the model to be gotten to and
offers a chance to check whether it meets the underlying target of the structure of the
necessary framework. Alongside that, it produces the necessary outcomes. With the
Testing and Analysis of Predictive … 431

help of RMSE, and MAPE, it can be determined which model is most suitable for
a particular time series dataset. The closer the value of RMSE and MAPE towards
zero, the better the model for that dataset.
Interface Building.
In this module, the work went under the interface development of the system. Also,
the team established a connection between the interface and the models that were
implemented in previous phases. Also, as per the requirement, the team can also revert
to the fourth phase of the implementation. Django was used as the web-framework
for this phase of the implementation. Where the interface named Nebula was built to
display all the data analysis that was done prior to this phase, and the metric table that
was generated. Also, it came to a conclusion tab where the team concluded the work
by explaining all the final results Nebula consisted of three main tabs, Landing page
describing what Nebula is, Dataset which shows a pictorial row of different datasets
respectively linked to their own Analysis page, and Algorithm tab that describes
everything about all the algorithm that was en-lighted in this system.
Deployment.
Once the models are evaluated and the interface is developed, it is deployed and put
to the ultimate test. It showed required comparative results and satisfied the objective
the team has taken prior to initiating the hands-on working on this system.

5 Experimental Results

The idea was to govern these five machine learning algorithms: Linear Regres-
sion, Auto ARIMA, K-Nearest Neighbor, Support Vector Machine, and Facebook’s
Prophet, by fitting three different datasets (Stock, Earthquake, and Sales) simultane-
ously, and finding out which one was the most suitable for Time Series Forecasting
based on figures generated from the two-evaluation metrics: RMSE and MAPE. The
first step in this system building was to understand the problem statement and work
on the objective taken. After understanding the idea behind the study, the team had
to start from the next phase that was non-other than data extraction and extraction.
The Methodology and Data Collection portion of this documentation describes
all the datasets in detail, including the number of columns and rows, type of dataset,
variables, and all. The dataset that is used are:
NSE—Tata Global Beverage.
Significant Earthquakes, 1965–2016 [US Geological Survey].
Superstore Dataset—Tableau.
Later, it all depends upon the data processing where elimination of unnecessary
data and edition or omission of irregular/missing data takes place. As explained in
previous sections, the preprocessing also involved setting the Date of the respective
dataset as their indexes and sorting them in ascending order. These all actions are
432 G. Khekare et al.

Fig. 2 Variables trend

taken place in Python3 with the help of various libraries such as Pandas, Numpy,
sci-kit learn, and datetime. It is very crucial to map date to ordinal as it played an
important role as the independent variable (datetime.toordinal() returns the day count
of a particular date in standard format).
After that, it is very necessary to which feature of the respective dataset fits the
best as the target variable against the Date variable of the respective variable. The
stock prediction dataset consisted of several potential variables named Open, High,
Low, Last, Close, and Total Trade. Total Trade was never an option for a perfect
candidate for the target variable as it doesn’t show any consistent trend as shown in
Fig. 3.
However, on comparing other features against date, it turns out that all other
variables Open, High, Low, Last, and Close follows the same trend i.e., proper over-
lapping on a line graph that is depicted in Fig. 2. So, eventually, the team decided
to pick any of these features as target variables, and they went with Close. The
relationship between Date and Close is depicted in Fig. 4.
Similarly, in the case of Earthquake Dataset, there were only two options to choose
the target variable, either it could be Depth or Magnitude. The relation between
Date/Year and Depth neither appeared categorical nor linear. The variation of Magni-
tude against Date comes out to be categorical in nature and can be chosen as the target
value for the model building. Figures 5 and 6 depict the required relationship between
Date and Depth, and Date and Magnitude respectively.
In the Sales dataset, despite having 20 variables, there were only two options to
choose the target variable, either it could be Sales or Quantity because Profit always
seemed inconsistent. But after, further analysis it comes to figure that relation between
Quantity and Date has been highly categorical than Sales and Date. So, Quantity has
been chosen as the independent variable for this phase, and the plot of the Fig. 7
Testing and Analysis of Predictive … 433

Fig. 3 Total trade versus year

Fig. 4 Date and close relation

speaks it all.
Finally, the team had target variables for each dataset. Close, Magnitude, and
Quantity came out to be the target variable for Stock, Earthquake, and Sales
forecasting dataset respectively.
The next part of the implementation process was the respective model building,
and the first thing in that was the initialization phase. The initialization phase was
considered for each dataset model building individually and done under the file
name ‘initiate.py’. The role of this file is to set Date variable as an index, and map
434 G. Khekare et al.

Fig. 5 Depth versus date

Fig. 6 Magnitude versus date


Testing and Analysis of Predictive … 435

Fig. 7 Sales dataset trends

Date to ordinal as it played an important role as the independent variable (date-


time.toordinal() returns the day count of a particular date in standard format). The
‘initiate.py’ also splits the dataset into train and test dataset which is going to be used
in the development of the various models.
The requirements for this python file include pandas, NumPy, and sklearn, which
are eventually going to be helpful for future implementation. After this whole lot of
436 G. Khekare et al.

process, now the stick turns towards model building for the respective dataset. Never-
theless, the process of model building to same for all the dataset, but the only changes
were the name of the dataset that was already imported in the respective initiate.py
files. The lr.py is used for the implementation of Linear regression, and after model
fitting, RMSE and MAPE were calculated and recorded. Similarly, the Support Vector
Machine was implemented in psvm.py by importing SVR from sklearn.svm.
K Nearest Neighbor was implemented in knn.py, and it required importing
the ‘neighbors’ module from sklearn to determine suitable data points. The Grid-
SearchCV in the file assists to curve over the predefined hyperparameters and
fit the required predicting model on the given training set. Therefore, one can
choose the most fitting parameters from the cataloged hyperparameters. ‘neigh-
bors.KNeighborsRegressor()’ is used to generate KNN and it will be used in model
fitting using GridSearchCV() based on neighbors. And similar to the previous models,
RMSE and MAPE were calculated and recorded.
Auto ARIMA was implemented in pmd.py using the ‘pmarima’ module and
‘ndiffs’ module from pmarima library, ndiffs does a unit root test to calculate the
number of variations needed for time series x to be made stable. Later, calculated
ndiffs is used for model building in pmarima.auto_arima(), and then after performing
forecasting, the team compares both the results and actual value to record RMSE and
MAPE value. Jupyter notebook was used in the implementation of the last model
called Facebook’s Prophet or simply the prophet. The prophet is imported from
fbprophet library for this implementation, and the model has fitted accordingly, and
then evaluation is performed.
After model building and recording all the statistics, the next job was to build
an interface that can display the results. Django web-framework was used for this
phase of the implementation. The interface was named ‘Nebula’ which was built to
display data analysis and results. As mentioned in the previous section, it is built
as it consisted of 3 different tabs including landing page, datasets, and algorithms.
The interface was built with the help of HTML, CSS, and Bootstrap, and it was
the destination after generating values of RMSE and MAPE of different models of
respective datasets. The upcoming sections will discuss the comparative results and
the conclusion.
The Django is initiated by creating a project name Nebula, followed by an app
called the nova, and there was no use of the database in this interface. Under new direc-
tory name templates (which was linked in settings.py of nebula directory), several
HTML files, and it was all going to display on the screen with the help of ‘views.py’
file. The static directory consisted of all the pictures, graphs, and CSS files for
making the information attractive. And in the end, the project nebula was deployed
after making some migrations, and the satisfactory result which was fitted while
making the HTML codes were visible now.
As per the discussion, the results that need to generate were nothing else but
the comparative results of the evaluation metrics value of the respective dataset.
First in that trail was the Stock Prediction dataset, and Table 1 describes shows the
comparative values for the same.
Testing and Analysis of Predictive … 437

Table 1 Results of stock


Algorithms RMSE MAPE
prediction dataset
Linear regression 47.51609 11.32705
K—nearest neighbor 65.11185 16.92529
Auto ARIMA 3.74366 0.72129
Prophet 53.01529 13.01318
Support vector machine 69.81082 12.44615

Table 2 Results of
Algorithms RMSE MAPE
earthquake forecasting dataset
Linear regression 0.43306 2.49101
K—nearest neighbor 0.46377 2.86797
Auto ARIMA 0.41603 2.58689
Prophet 0.43047 2.71666
Support vector machine 0.43734 2.78535

Auto ARIMA has been the best performer with the lowest value of RMSE and
MAPE. However, SVM and KNN are the worst performers according to the RMSE
and MAPE respectively. Similarly, Table 2 shows the output generated for the Earth-
quake dataset, and here reader can observe that Auto ARIMA and Linear Regression
are the best performers with the lowest value of RMSE and MAPE respectively.
However, KNN was the worst performer according to both RMSE and MAPE. But
the numerals were so much close in this case.
The results of the Sales forecasting dataset are described in Table 3, where it can
be observed that Linear Regression and SVM turns out to be the best performer with
the lowest value of RMSE and MAPE respectively. However, KNN was the worst
performer according to both RMSE and MAPE.
The Table 4 depicts the ranking of each algorithm on the basis of two evaluation
metrics.
The graphs in Fig. 8 shows the comparison of the value attains by the Evaluation
metrics. The Tata Global Beverage graph signifies that RMSE has higher values than
MAPE; however, the other two datasets say otherwise. Ultimately, it all depends on
the target variable and dataset.

Table 3 Results of sales


Algorithms RMSE MAPE
forecasting dataset
Linear regression 2.22990 23.76444
K—nearest neighbor 2.35999 24.30888
Auto ARIMA 2.23614 23.97399
Prophet 2.24678 24.12586
Support vector machine 2.33276 22.56927
438

Table 4 Evaluation ranking


Metrics RMSE MAPE
Datasets/ranking 1 2 3 4 5 1 2 3 4 5
NSE—tata global Auto ARIMA LR Prophet KNN SVM Auto ARIMA LR SVM Prophet KNN
beverage
Significant Auto ARIMA Prophet LR SVM KNN LR Auto ARIMA Prophet SVM KNN
earthquakes
Superstore LR Auto ARIMA Prophet SVM KNN SVM LR Auto ARIMA Prophet KNN
G. Khekare et al.
Testing and Analysis of Predictive … 439

Fig. 8 Model performance comparative graph


440 G. Khekare et al.

The trend of First Dataset says Auto ARIMA has a significantly lower value of
RMSE (3.74366) and MAPE (0.72129) than other models. However, talking about
the worst performer, KNN beats other algorithms according to RMSE (16.92529),
and SVM according to RMSE (69.81082). Looking at the trend of the second
dataset one can say that there is minimal difference between models according to
RMSE; however, among all Auto ARIMA (0.41603) gave a bit better satisfying result.
But according to the MAPE, Linear Regression (2.49101) went on top followed by
Auto ARIMA (2.58689). White RMSE and MAPE both signified that KNN wouldn’t
be a good choice for this dataset.
The third dataset i.e., for Sales prediction had very difficult in choosing an optimal
algorithm according to the graph. Nevertheless, Linear Regression became the more
favorable algorithm than others according to the numbers of RMSE (2.22990). Simi-
larly, SVM became a more optimal algorithm according to MAPE (22.56927). But
again, KNN significantly became not a good choice.
In the end, it won’t be wrong to say that everything depends upon the trends and
variables of the dataset, and that’s why choosing an appropriate machine learning
model becomes priority before going for a business idea. Here, one can observe that
there is small difference between results of the evaluation metrics of earthquake and
sales dataset. Yet, the numeral gaps between Auto ARIMA and other models in Stock
Prediction dataset is very large.

6 Conclusion

The main contribution of this chapter is to analyze best known machine learning
algorithms and to build a system through which it can be predicted the best suited
algorithm as per the requirements. Experimental performance analysis of five algo-
rithms viz., linear regression, K—Nearest Neighbor, Auto ARIMA, Prophet, and
Support Vector Machine is done. Stock market, earth and sales forecasting data is
analyzed. To compare the performance and accuracy of these algorithms, RMSE and
MAPE are used as the evaluation metrics. Lower the value of RMSE and MAPE,
the better the algorithm. As per the results, according to the RMSE, Auto ARIMA
is the most optimal algorithm in two cases out of three. However, MAPE states that
the Auto ARIMA is suitable for only one case. Taking it all in determination, it can
be said that Auto ARIMA jostled all the other four algorithms, followed by Linear
regression in the second place. Also, KNN is going to be the worst choice for Time-
Series Forecasting. In the end, it won’t be wrong to say that everything depends
upon the trends and variables of the dataset, and that’s why choosing an appropriate
machine learning model becomes priority before going for a business idea. Here, one
can observe that there is small difference between results of the evaluation metrics
of earthquake and sales dataset. Yet, the numeral gaps between Auto ARIMA and
other models in Stock Prediction dataset is clearly observed.
However, everything that is created or developed as room for improvement, that
means there are several future scopes of the same. One can augment other algorithms
Testing and Analysis of Predictive … 441

to this system or can use other evaluation metrics for the same. Also, other datasets
can be used for further analysis. Using this kind of system, one can choose the most
appropriate algorithm for their work, and it would be helpful for future research
purposes as well.

References

1. Y. Ding, S. Han, Z. Tian et al., Review on occupancy detection and prediction in building
simulation. Build. Simul. 15, 333–356 (2022). https://doi.org/10.1007/s12273-021-0813-8
2. Vansh Jatana, Machine Learning Algorithms (2019)
3. B. Abdualgalil, S. Abraham, Applications of machine learning algorithms and performance
comparison: a review, in International Conference on Emerging Trends in Information Tech-
nology and Engineering (ic-ETITE), pp. 1–6, Vellore, India (2020). doi: https://doi.org/10.
1109/ic-ETITE47903.2020.490
4. Dasgupta Ariruna, Nath Asoke, Classification of machine learning algorithms. Int. J. Innov.
Res. Adv. Eng. (IJIRAE). ISSN: 2349-2763. 3. 6-11. https://doi.org/10.6084/M9.FIGSHARE.
3504194.V1. (2016)
5. Mondal Prapanna, Shit Labani, Goswami Saptarsi, Study of effectiveness of time series
modeling (Arima) in forecasting stock prices. Int. J. Comp. Sci. Eng. Appl. 4, 13–29 (2014).
https://doi.org/10.5121/ijcsea.2014.4202
6. Kemal Korjenić, Kerim Hodžić, Dženana Ðonk, Application of Facebook’s prophet algorithm
for successful sales forecasting based on real-world data. Int. J. Eng Data Techn. (IJCSIT). Vol
twelve, No 2 (2020). doi:ten.5121/ijcsit.2020.12203
7. Panigrahi Sibarama, H. Behera Dr., A study on leading machine learning techniques for high
order fuzzy time series forecasting. Eng. Appl. Art. Intell. 87, 103245 (2020)
8. Roondiwala Murtaza, Patel Harshal, Varma Shraddha, Predicting stock prices using LSTM.
Int. J. Sci. Res. (IJSR) (2017)
9. Joosery Baleshwarsingh, G. Deepa, Comparative analysis of time-series forecasting algorithms
for stock price prediction (2020), pp. 1–6
10. A.A. Ariyo, A.O. Adewumi, C.K. Ayo, Stock price prediction using the ARIMA model,
in UKSim-AMSS 16th International Conference on Computer Modelling and Simulation
(Cambridge, 2014), pp. 106–112
11. G. Khekare, P. Verma, Prophetic probe of accidents in Indian smart cities using machine
learning, in V. Bhateja, S.C. Satapathy, C.M. Travieso-González, V.N.M. Aradhya (eds),
Data Engineering and Intelligent Computing. Advances in Intelligent Systems and Computing
(Springer, Singapore, 2021), vol. 1407. https://doi.org/10.1007/978-981-16-0171-2_18
12. S.B. Imandoust, Bolandraftar Mohammad, Application of K-nearest neighbor (KNN) approach
for predicting economic events theoretical background. S B Imandoust et al. Int. J. Eng. Res.
Appl. 3(5), 605–661 (2013)
13. K. Ayyub, S. Iqbal, E.U. Munir, M.W. Nisar, M. Abbasi, Exploring diverse features
for sentiment quantification using machine learning algorithms, in IEEE Access, vol. 8,
pp. 142819–142831 (2020)
14. G. Khekare, Internet of everything (IoE): intelligence, cognition, catenate. MC Eng. Themes
1(2), 31–32 (2021)
15. Y. Zhang, Y.-M. Cheung, Learnable weighting of intra-attribute distances for categorical data
clustering with nominal and ordinal attributes, in IEEE Transactions on Pattern Analysis and
Machine Intelligence (2021)

You might also like