0% found this document useful (0 votes)
36 views

An Overview of Practical Time Series Forecasting Using Pytho

Uploaded by

Lawrence Owusu
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
36 views

An Overview of Practical Time Series Forecasting Using Pytho

Uploaded by

Lawrence Owusu
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 30

Practical Time Series Forecasting with Python,

and SARIMAX

An Ultimate Guide to hands-on-practical time series


implementation using Python, and algorithms like SARIMAX

By

Aditya Kaushal
Copyright
Copyright © 2021 by Aditya Kaushal

All rights reserved. No part of this Book may be reproduced, distributed, or


transmitted in any form or means, including photocopying, recording, or
any other electronic or mechanical methods, without prior written
permission of the published, except in the case of brief quotations embodied
in critical reviews and certain other non-commercial uses by the copyright
law.

For feedback, media contact, omissions or errors regarding this book, please
contact the author at: [email protected]
Table of Contents
Preface
Prerequisites
What’s in it for you?
Let’s get started
Objective
Methodology/Tools Deployed:
Dataset
Installing essential Libraries
Importing Raw Data
Visualising Imported Dataset
Data Pre-processing
Checking for NULL values
Re-sampling the data from 30-second to 1 hour Interval
Decomposing the Time Series
What is Additive and Multiplicative?
What is Grid Search and AIC?
Train and Test
Afterword

Preface

This is a short book to show the readers how to build a Time Series Model
using mathematical models, Python and concepts of statistics to predict
real-time air quality in a local mapped area by using open source data from
OpenAQ sensors that measures Air pollution metrics. Currently OpenAQ is
collecting data in many different countries and primarily aggregate PM2.5,
PM10, ozone (O3), sulphur dioxide (SO2), and many more additional
metrics. The main objective of this book is to teach the readers how to build
a Python project to forecast and monitor air pollution to track personal
exposure to PM 2.5.

Prerequisites

The readers are expected to have a basic understanding and hands-on-


experience with Python Programming knowledge. You should be familiar
with the concept of time series forecasting. The concept of Time Series
forecasting requires you to have some basic or beginner level knowledge of
statistics and mathematical concepts like Averages, and Moving Averages.

What’s in it for you?

Most articles and online blogs only show you how to forecast for a very
short period of time. Further, there are lots of things that are missing in
between which helps to correlate with some of the information in the blog
to have a broader understanding of the whole concept. This book’s intent is
to deliver a concise and most easy way to have a forecast which is reliable
and effective in terms of making good decisions based on the insights
extracted.

At the end of the book, you will have a good understanding of SARIMAX
Algorithm to make a good forecast Particulate Matter 2.5 (PM 2.5) similar
to what Sci-kit – Learn regression algorithms provide.
Let’s get started

This book would straightaway deep dive into the implementation with the
code snippets and all the visualizations. It is an requirement to be familiar
with Python, Scikit Libraries (NumPy, Pandas, Seaborn, Matplotlib), and
other miscellaneous libraries. Other than that, it is good to have a good
understanding of mathematical concepts like Moving Averages and other
statistical concepts. For any questions and comments, feel free to drop me a
message at: [email protected]

Objective

The main Objective of the Book is to Prototype a model suited for


forecasting PM 2.5. The aim of this Book is to help you understand and
develop a system to predict future values of Particulate Matter 2.5, so to
generate quality insights. This e-book consists of content which helps the
readers to understand and analyse the various fluctuations in PM 2.5. This
Book will help you to have an overview of mathematical tools like moving
averages, Seasonal Auto Regressive Integrated Moving Average, Time
Series Analysis, Python Programming Language and it’s frameworks like
Flask.

Methodology/Tools Deployed

The utilization of NumPy, Pandas, Matplotlib, Seaborn, Time Series


Forecasting Algorithms like (SARIMAX) Statistical Components, Tableau
and Python will help you to gain practical exposure to implement a full-
fledged Flask web application to forecast air quality.

Dataset
Before starting we need to have the historical and trustable data in hand to
make predictions. OpenAQ is an open source non-profit organization
empowering communities around the globe to clean their air by sharing and
using open air quality data. We can download the raw data in csv formats
from their website.

Fig. RK Puram, New Delhi, India PM 2.5 Air Quality measurement

Fig. Different Air Quality metrics of the location


Fig. Selecting options
Fig. Select the metric you want in your dataset

Installing essential Libraries

After downloading the dataset we need to clean, perform some data


wrangling, data imputations, filling null values and do some data
corrections. We also need to download some python libraries such as
NumPy, and Pandas. Further, we need to install matplotlib to have a visual
understanding of the data. Run the below mentioned command in your
terminal.
1. NumPy is a library for the Python Programming language, adding support for
large, multidimensional, arrays and matrices, along with a large collection of high-
level mathematical functions to operate on these datasets.
2. Pandas helps to manipulate your data by facilitating operations such as selecting,
replacing columns and indices or reshaping your data.
3. Matplotlib helps to visualize your data using different types of charts and plots to
have a visual understanding.
4. Statsmodel is a Python package that allows users to explore data, estimate
statistical models, and perform statistical tests. An extensive list of descriptive
statistics, statistical tests, plotting functions, and result statistics are available for
different types of data and each estimator.
5. Prophet is a forecasting procedure implemented in Python and R. It is fast and
provides completely automated forecasts that can be tuned by hand by data
scientists and analysts.

That’s all we need to start, and we can begin with importing the data and
start with our model creation.

Importing Raw Data

In this section we will import all the libraries and the raw data.
All the above libraries would help to preprocess the data to make it usable
for forecasting. The raw data contains lots of anomalies such as empty
rows, NULL values, and string data which is not usable to a time series
model.

Caveats #1: In this book a different set of dataset is being used which
cannot be shared with the readers due to copyright and privacy reasons.
Due to this it is recommended that the readers download the data from
OpenAQ platform for their project purposes and follow the guide along this
book to create forecasting model for their own purposes.

The next step is to import the data using the below mentioned snippet of
code. We have utilised the Pandas library to convert the raw csv datasheet
into a Pandas Data-frame to perform data pre-processing.

The pandas . read_csv() function is used to convert the csv file into a Data-
frame. The ‘ device41.csv’ is the name of the csv file. The dataset contains
the PM 2.5 values corresponding to the datetime values i.e., the datetime
needs to be converted into a proper format which can be deciphered by
python matplotlib and statsmodel statistical functions . Hence we have
used the parameter parse_dates = [‘Datetime’].

Caveat #2: The data downloaded from the OpenAQ platform would be
having different columns. This is an exercise for the reader to manipulate
the data-frame which would be suitable for forecasting purposes. The
reader should find a way to only have two columns i.e., Datetime and
PM2.5 values.

Remove all the columns that you don’t need in the dataset. This is a choice
that is left to the readers needs and purposes. This can be unique and does
not affect the overall the objective of the project.

After importing the data we can see the top 5 rows for verifying if the
dataset is as per our desired format.
Fig. Pm2.5 data-frames

Visualizing Imported Dataset

Fig. PM2.5 plots.


The matplotlib library would help us understand the overall fluctuations of
the PM 2.5 values in our dataset. In this plot, as we can see, the X – axis
represents the DateTime and the Y-Axis represents the PM 2.5 values. In
this case the PM 2.5 values have a lot of fluctuations.

Before moving forward we need to understand about seasonality and trends.


Seasonality : Seasonality is a characteristic of a time series model where
the dataset experiences frequent and predictable changes that recur every
year. Any predictable fluctuation or pattern that recurs or repeats over a
one-year period is said to be seasonal.

e.g. Ice-cream sales can be the best example to explain seasonality. Ice
cream sales are usually higher in the summer seasons as compared to the
winter seasons. So, based upon the actual ice-cream sales there can be a
predictable seasonal fluctuation in the sales. Seasonality is often
predictable.

Trends: The trend is the component of a time series that represents


variations of low frequency in a time series, the high and medium frequency
fluctuations having been filtered out.

e.g. Trends can be often seen in the stock market. If a group of people
predicts that a particular stock is going to be profitable, then this can spread
like a wildfire. This can lead to a increase in buying of that particular stock.
So, based upon that speculation we would be seeing a trend in the price of
that stock. Trends can be of many types, which is beyond the scope of this
book.

Fig. Seasonality and Trends.

Data Pre-processing

This process is usually the most important and usually overlooked. Data
Pre-processing is that part which can determine the quality/reliability of
your forecasts. Make sure you spend enough time in determining that your
data looks pristine and reliable after pre-processing. This means that the
dataset should not have any NULL values, no string values which are going
to be considered as an input to the model, no huge variations in the
numbers, and the datetime should be absolutely consistent in its spacing.

The data pre-processing for this dataset has to go from certain quality
checkmarks.
1. Consistency in Datetime
2. Variation in PM 2.5 values

The values should be consistent overall, all the outliers should be removed
from the dataset. This is because there can be certain days where the PM
2.5 values was very low/high. This can be due to a lot of reasons. But we
have to make sure that we remove all the outliers from the dataset.

We also need to resample the data over a time period i.e., 15 minutes, 30
minutes, 1 hour, 2 hour, or any specific time period as per our wish. This is
due to the reason as we can then specify the exact time in the model for
which we have to forecast the PM 2.5 values. If we want to have forecast of
PM 2.5 every 1 hour, then we need to resample the data for every hour.

Checking for NULL values


For the model to consider the datetime we would need to convert the
datetime column as an index of the dataset.

The to_index() function would make the DateTime column as the index of
the dataset. This is necessary as the model requires the index of the dataset
to be as a datetime column.

Fig. DateTime index

Re-sampling the data from 30-second to 1


hour Interval
The above mentioned snippet of code would resample or would take a
rolling mean of all the combined 30 second data evaluating as 1 hour and
consider that values as the average pm 2.5 value in that hour. In simpler
terms this means the average of pm 2.5 in every 30 second would be adding
PM 2.5 values for 120 times and dividing it by 120. This would convert the
data from 30 second interval to hourly interval basis. This would also help
us to reduce the size of the dataset to our needs.

The fillna() method is used to fill any null values if there are any after the
resampling is done. The ffill is used as a forward fill which is used for
telling the fillna() function to fill the null value by taking the average of the
next two value after the null values.

Fig. Dataset
We can save the newly resampled data into a new data-frame variable.

This would save all the resampled values inside the df_hrs variable.

Fig. Rolling Mean of PM2.5 values.

From the above image we can see that all the values have been resampled
and have been converted on an hourly basis. To see the data on an hourly
basis we need to slice the data.

You can zoom into the data using the slice operator. The datetime column is
an index column so you have to slice the index and then plot that sliced
range.

Decomposing the Time Series


A ny time series is composed of two things
1. Seasonality
2. Trends
By the help of statsmodel library we can break the time series into its
seasonal pattern and trends. This will help us to understand the data clearly
and will help us to make more sense of the data.

Let’s decompose the data using the statsmodel library.

The above snippet of code would help us to decompose the data into an
additive seasonal and trend patterns. The time series can be an additive or
multiplicative of its seasonal and trend component. I have explained the
additive and multiplicative time series.

Fig. The pm 2.5 is at the peak in the afternoon and at the lowest in the morning. This
pattern can be identified as seasonality when forecasting air quality.
What is Additive and Multiplicative?

There are three components to a time series:


1. Trend : Trend tells you how things are overall changing.
2. Seasonality: Seasonality shows you how things change within a given
period e.g., a year, month, week, day.
3. Residual: The Error/residual/irregular activity are the anomalies which
cannot be not explained by the trend or the seasonal value .

In a multiplicative time series , the components multiply together to make


the time series. If you have an increasing trend, the amplitude of seasonal
activity increases. Everything becomes more exaggerated. This is common
when you’re looking at web traffic.

In an additive time series , the components add together to make the time
series . If you have an increasing trend, you still see roughly the same size
peaks and troughs throughout the time series. This is often seen in indexed
time series where the absolute value is growing but changes stay relative.

The most commonly and recommended methods used in time series


forecasting is known as ARIMA model. In this book I have used an
extended version of ARIMA model known as SARIMAX ( Seasonal Auto
Regressive Integrated Moving Averages with exogenous factors ) model.
The SARIMAX model is used when the data sets have seasonal cycles . In
the datasets concerning air quality/pm2.5 there is an seasonal pattern which
I have explained in the above section.

ARIMA is a model that can be fitted to time series data in order to better
understand or predict future points in the series.

In the above snippet of code we are finding the right p, d, and q parameters
to correctly forecast and predict the pm2.5 values. These values are crucial
and have to be near ideal to have reliable forecasts.

There are three distinct integers (p, d, q) that are used to parametrize
ARIMA models. Because of that, ARIMA models are denoted with the
notation ARIMA(p, d, q). Together these three parameters account for
seasonality, trend, and noise in datasets:
p is the auto-regressive part of the model. It allows us to incorporate the
effect of past values into our model. Intuitively, this would be similar to
stating that it is likely to be warm tomorrow if it has been warm the past
3 days.

d is the integrated part of the model. This includes terms in the model
that incorporate the amount of differencing (i.e. the number of past time
points to subtract from the current value) to apply to the time series.
Intuitively, this would be similar to stating that it is likely to be same
temperature tomorrow if the difference in temperature in the last three
days has been very small.

q is the moving average part of the model. This allows us to set the error
of our model as a linear combination of the error values observed at
previous time points in the past.
Fig. AIC Grid Search Values
This above method is also known as the grid search method for finding the
right p,d,q values that would be given as an input to the SARIMAX time
series model.

What is Grid Search and AIC?

Grid search is a tuning technique that attempts to compute the optimum


values of hyperparameters.

We have to find the lowest AIC values which would have the best
corresponding p,d,q values to have the best forecast of PM 2.5 values.

In my case the best AIC value was ARIMA(1, 1, 1)x(0, 1, 1, 12)12 -


AIC:1781.133929163659

The Akaike Information Criteria (AIC ) is a widely used measure of a


statistical model. It basically quantifies 1) the goodness of fit , and 2) the
simplicity/parsimony , of the model into a single statistic.
When comparing two models, the one with the lower AIC is generally
“better”.
Fig. Summary Tables

The above snippet of code would help them to fit the dataset to the
SARIMAX model. As seen in the 1 st line in the above code we have used
the p,d,q values that we searched using the grid search method.

Then the results = mod.fit() is used to fit the model. The table above will
show you all the statistical variables such as the Z score, P values and
standard errors.

Fig. Summary Plots.


Train and Test

The only step left is the verification/testing of the model that we just
created. We have to split the data into a train and test dataset. This will help
us to actually verify that the result are somewhat reliable.

To split the data, it is recommended to split it in 70:30 ratio . 70% of the


data is the training data , and 30% of the data is the testing data.

The training data is the dataset which is used to train the model. The model
will be trained on the patterns/fluctuations existing in the training dataset.
Whereas, the testing dataset is the unseen data. The model has to
predict/forecast the values based on the training data. If the forecasted data
overlaps the testing data values then we can say that forecast can
predict/forecast reliable and trustable values.

The above code is used to get the predicted values after we have created the
model. The .get_prediction() method is used to get the predicted values
based on the datetime you have mentioned in the start parameter.
Fig. Forecasted and Observed values.
The above graph is the clear example and evidence that the testing data and
training data (observed and Predicted) are overlapping which I discussed in
the above section. This means that the forecasting model is overall
performing as it should.
The next step is to create a separate data-frame which would be helpful to
compare the true test data and the predicted values by using the mean
square error estimation.

The above snippet of code would be helpful to calculate the mean square
error.

As we are approaching towards the end of the project we can see that the
above code is responsible to forecast/predict the next 7 values. The variable
results contains actual model information which is discussed and mentioned
in the above sections.

The . get_forecast() method is responsible to take the information about the


model from the variable results, and then based upon the observation of the
various patterns it would generate the required forecast value.
The final part is to create the actual plot which would then make all the
sense of this long project. As you can see, the observed and the forecasted
values in the plot i.e., the blue line is the observed values and the orange
lines represents the values which are predicted using the SARIMAX Time
Series model. The shaded region tells that it is within 95% confidence
interval.

The readers can even extend this project by creating a web app or mobile
Application. The below charts can be plotted using various web plotting
libraries such as Plotly.

As I said earlier that with the help of this book you can learn how to
forecast up to a significant time. But, there are some caveats to this:
1. Air Quality is subjected to external factors which are uncontrollable
and natural such as weather, wind speed, temperature, and pressure.
You also need to find out the correlation between these variables. But
overall, the forecast can give you a general sense of how the value
would fluctuate.
2. The forecast reliability also depends upon the algorithm used.

You can alter the steps parameter in the get_prediction(step = ‘’) method
to any desired value. But, be careful and study the properties of the
values/metrics you are forecasting. Sometime, the values are only good
until forecasted up to a certain step.
Fig. Forecasted values

Afterword

This e-book is relatively very short to read. The purpose of this book was to
help to make the readers finish this book in one sitting. I hope you have
enjoyed the book so far. There are so many other methods out there like
Facebook Prophet, ARIMA, even supervised learning algorithms such as
Linear regression. All these above algorithms and methods can also give
you decent results. Time Series data is everywhere around us. Once again
thank you for the purchase. I hope this book helped you in creating the
project and having a better understanding of the forecasting algorithms.

You might also like