An Overview of Practical Time Series Forecasting Using Pytho
An Overview of Practical Time Series Forecasting Using Pytho
and SARIMAX
By
Aditya Kaushal
Copyright
Copyright © 2021 by Aditya Kaushal
For feedback, media contact, omissions or errors regarding this book, please
contact the author at: [email protected]
Table of Contents
Preface
Prerequisites
What’s in it for you?
Let’s get started
Objective
Methodology/Tools Deployed:
Dataset
Installing essential Libraries
Importing Raw Data
Visualising Imported Dataset
Data Pre-processing
Checking for NULL values
Re-sampling the data from 30-second to 1 hour Interval
Decomposing the Time Series
What is Additive and Multiplicative?
What is Grid Search and AIC?
Train and Test
Afterword
Preface
This is a short book to show the readers how to build a Time Series Model
using mathematical models, Python and concepts of statistics to predict
real-time air quality in a local mapped area by using open source data from
OpenAQ sensors that measures Air pollution metrics. Currently OpenAQ is
collecting data in many different countries and primarily aggregate PM2.5,
PM10, ozone (O3), sulphur dioxide (SO2), and many more additional
metrics. The main objective of this book is to teach the readers how to build
a Python project to forecast and monitor air pollution to track personal
exposure to PM 2.5.
Prerequisites
Most articles and online blogs only show you how to forecast for a very
short period of time. Further, there are lots of things that are missing in
between which helps to correlate with some of the information in the blog
to have a broader understanding of the whole concept. This book’s intent is
to deliver a concise and most easy way to have a forecast which is reliable
and effective in terms of making good decisions based on the insights
extracted.
At the end of the book, you will have a good understanding of SARIMAX
Algorithm to make a good forecast Particulate Matter 2.5 (PM 2.5) similar
to what Sci-kit – Learn regression algorithms provide.
Let’s get started
This book would straightaway deep dive into the implementation with the
code snippets and all the visualizations. It is an requirement to be familiar
with Python, Scikit Libraries (NumPy, Pandas, Seaborn, Matplotlib), and
other miscellaneous libraries. Other than that, it is good to have a good
understanding of mathematical concepts like Moving Averages and other
statistical concepts. For any questions and comments, feel free to drop me a
message at: [email protected]
Objective
Methodology/Tools Deployed
Dataset
Before starting we need to have the historical and trustable data in hand to
make predictions. OpenAQ is an open source non-profit organization
empowering communities around the globe to clean their air by sharing and
using open air quality data. We can download the raw data in csv formats
from their website.
That’s all we need to start, and we can begin with importing the data and
start with our model creation.
In this section we will import all the libraries and the raw data.
All the above libraries would help to preprocess the data to make it usable
for forecasting. The raw data contains lots of anomalies such as empty
rows, NULL values, and string data which is not usable to a time series
model.
Caveats #1: In this book a different set of dataset is being used which
cannot be shared with the readers due to copyright and privacy reasons.
Due to this it is recommended that the readers download the data from
OpenAQ platform for their project purposes and follow the guide along this
book to create forecasting model for their own purposes.
The next step is to import the data using the below mentioned snippet of
code. We have utilised the Pandas library to convert the raw csv datasheet
into a Pandas Data-frame to perform data pre-processing.
The pandas . read_csv() function is used to convert the csv file into a Data-
frame. The ‘ device41.csv’ is the name of the csv file. The dataset contains
the PM 2.5 values corresponding to the datetime values i.e., the datetime
needs to be converted into a proper format which can be deciphered by
python matplotlib and statsmodel statistical functions . Hence we have
used the parameter parse_dates = [‘Datetime’].
Caveat #2: The data downloaded from the OpenAQ platform would be
having different columns. This is an exercise for the reader to manipulate
the data-frame which would be suitable for forecasting purposes. The
reader should find a way to only have two columns i.e., Datetime and
PM2.5 values.
Remove all the columns that you don’t need in the dataset. This is a choice
that is left to the readers needs and purposes. This can be unique and does
not affect the overall the objective of the project.
After importing the data we can see the top 5 rows for verifying if the
dataset is as per our desired format.
Fig. Pm2.5 data-frames
e.g. Ice-cream sales can be the best example to explain seasonality. Ice
cream sales are usually higher in the summer seasons as compared to the
winter seasons. So, based upon the actual ice-cream sales there can be a
predictable seasonal fluctuation in the sales. Seasonality is often
predictable.
e.g. Trends can be often seen in the stock market. If a group of people
predicts that a particular stock is going to be profitable, then this can spread
like a wildfire. This can lead to a increase in buying of that particular stock.
So, based upon that speculation we would be seeing a trend in the price of
that stock. Trends can be of many types, which is beyond the scope of this
book.
Data Pre-processing
This process is usually the most important and usually overlooked. Data
Pre-processing is that part which can determine the quality/reliability of
your forecasts. Make sure you spend enough time in determining that your
data looks pristine and reliable after pre-processing. This means that the
dataset should not have any NULL values, no string values which are going
to be considered as an input to the model, no huge variations in the
numbers, and the datetime should be absolutely consistent in its spacing.
The data pre-processing for this dataset has to go from certain quality
checkmarks.
1. Consistency in Datetime
2. Variation in PM 2.5 values
The values should be consistent overall, all the outliers should be removed
from the dataset. This is because there can be certain days where the PM
2.5 values was very low/high. This can be due to a lot of reasons. But we
have to make sure that we remove all the outliers from the dataset.
We also need to resample the data over a time period i.e., 15 minutes, 30
minutes, 1 hour, 2 hour, or any specific time period as per our wish. This is
due to the reason as we can then specify the exact time in the model for
which we have to forecast the PM 2.5 values. If we want to have forecast of
PM 2.5 every 1 hour, then we need to resample the data for every hour.
The to_index() function would make the DateTime column as the index of
the dataset. This is necessary as the model requires the index of the dataset
to be as a datetime column.
The fillna() method is used to fill any null values if there are any after the
resampling is done. The ffill is used as a forward fill which is used for
telling the fillna() function to fill the null value by taking the average of the
next two value after the null values.
Fig. Dataset
We can save the newly resampled data into a new data-frame variable.
This would save all the resampled values inside the df_hrs variable.
From the above image we can see that all the values have been resampled
and have been converted on an hourly basis. To see the data on an hourly
basis we need to slice the data.
You can zoom into the data using the slice operator. The datetime column is
an index column so you have to slice the index and then plot that sliced
range.
The above snippet of code would help us to decompose the data into an
additive seasonal and trend patterns. The time series can be an additive or
multiplicative of its seasonal and trend component. I have explained the
additive and multiplicative time series.
Fig. The pm 2.5 is at the peak in the afternoon and at the lowest in the morning. This
pattern can be identified as seasonality when forecasting air quality.
What is Additive and Multiplicative?
In an additive time series , the components add together to make the time
series . If you have an increasing trend, you still see roughly the same size
peaks and troughs throughout the time series. This is often seen in indexed
time series where the absolute value is growing but changes stay relative.
ARIMA is a model that can be fitted to time series data in order to better
understand or predict future points in the series.
In the above snippet of code we are finding the right p, d, and q parameters
to correctly forecast and predict the pm2.5 values. These values are crucial
and have to be near ideal to have reliable forecasts.
There are three distinct integers (p, d, q) that are used to parametrize
ARIMA models. Because of that, ARIMA models are denoted with the
notation ARIMA(p, d, q). Together these three parameters account for
seasonality, trend, and noise in datasets:
p is the auto-regressive part of the model. It allows us to incorporate the
effect of past values into our model. Intuitively, this would be similar to
stating that it is likely to be warm tomorrow if it has been warm the past
3 days.
d is the integrated part of the model. This includes terms in the model
that incorporate the amount of differencing (i.e. the number of past time
points to subtract from the current value) to apply to the time series.
Intuitively, this would be similar to stating that it is likely to be same
temperature tomorrow if the difference in temperature in the last three
days has been very small.
q is the moving average part of the model. This allows us to set the error
of our model as a linear combination of the error values observed at
previous time points in the past.
Fig. AIC Grid Search Values
This above method is also known as the grid search method for finding the
right p,d,q values that would be given as an input to the SARIMAX time
series model.
We have to find the lowest AIC values which would have the best
corresponding p,d,q values to have the best forecast of PM 2.5 values.
The above snippet of code would help them to fit the dataset to the
SARIMAX model. As seen in the 1 st line in the above code we have used
the p,d,q values that we searched using the grid search method.
Then the results = mod.fit() is used to fit the model. The table above will
show you all the statistical variables such as the Z score, P values and
standard errors.
The only step left is the verification/testing of the model that we just
created. We have to split the data into a train and test dataset. This will help
us to actually verify that the result are somewhat reliable.
The training data is the dataset which is used to train the model. The model
will be trained on the patterns/fluctuations existing in the training dataset.
Whereas, the testing dataset is the unseen data. The model has to
predict/forecast the values based on the training data. If the forecasted data
overlaps the testing data values then we can say that forecast can
predict/forecast reliable and trustable values.
The above code is used to get the predicted values after we have created the
model. The .get_prediction() method is used to get the predicted values
based on the datetime you have mentioned in the start parameter.
Fig. Forecasted and Observed values.
The above graph is the clear example and evidence that the testing data and
training data (observed and Predicted) are overlapping which I discussed in
the above section. This means that the forecasting model is overall
performing as it should.
The next step is to create a separate data-frame which would be helpful to
compare the true test data and the predicted values by using the mean
square error estimation.
The above snippet of code would be helpful to calculate the mean square
error.
As we are approaching towards the end of the project we can see that the
above code is responsible to forecast/predict the next 7 values. The variable
results contains actual model information which is discussed and mentioned
in the above sections.
The readers can even extend this project by creating a web app or mobile
Application. The below charts can be plotted using various web plotting
libraries such as Plotly.
As I said earlier that with the help of this book you can learn how to
forecast up to a significant time. But, there are some caveats to this:
1. Air Quality is subjected to external factors which are uncontrollable
and natural such as weather, wind speed, temperature, and pressure.
You also need to find out the correlation between these variables. But
overall, the forecast can give you a general sense of how the value
would fluctuate.
2. The forecast reliability also depends upon the algorithm used.
You can alter the steps parameter in the get_prediction(step = ‘’) method
to any desired value. But, be careful and study the properties of the
values/metrics you are forecasting. Sometime, the values are only good
until forecasted up to a certain step.
Fig. Forecasted values
Afterword
This e-book is relatively very short to read. The purpose of this book was to
help to make the readers finish this book in one sitting. I hope you have
enjoyed the book so far. There are so many other methods out there like
Facebook Prophet, ARIMA, even supervised learning algorithms such as
Linear regression. All these above algorithms and methods can also give
you decent results. Time Series data is everywhere around us. Once again
thank you for the purchase. I hope this book helped you in creating the
project and having a better understanding of the forecasting algorithms.