Time Series Linear Models
Time Series Linear Models
• 1. Collection of data
• 2. Data preparation and missing/outlier treatment.
• 3. Data analysis and feature engineering : Data needs to be analyzed in order to
• find any hidden patterns and relations between variables, and so on
• 4. Train algorithm on training and validation data: data will be divided into three subsets (training,
validation, and test data) for guarantee the forecasting validation.
• 5. Test the algorithm on test data: Once the model has shown a good enough
performance on train and validation data, its performance will be checked against
• unseen test data. If the performance is still good enough, we can proceed to the
• next and final step.
Training, validation, and test data
Machine learning utilizes optimization for tuning all the parameters of various algorithms.
Hence, it is a good idea to know some basics about optimization.
Example of Econometrics and Machine
learning models
• 1. Regression
• 2. Decision Tree
• 3. Random Forest
• 4. Support Vector Machine
• 5. Neural network
1. Regression
• Linear regression is a basic and commonly used type of predictive analysis. The overall idea of
regression is to examine two things:
• (1) Does a set of predictor variables do a good job in explaining an outcome (dependent) variable?
• (2) Which variables in particular are significant predictors of the outcome variable, and in what way do
they–indicated by the magnitude and sign of the beta estimates–impact the outcome variable?
y
• These regression estimates are used to explain the relationship between one dependent variable and
one or more independent variables.
y X u
Dynamic regression model (Autoregressive
(AR model)
• AR (AutoRegressive) model is a type of statistical model used for analyzing and forecasting time
series data. In an AR model, the current value of a variable is expressed as a linear combination of
its previous values, plus a random error term. The general form of an AR model of order p (AR(p))
is:
Yt 1 t 1 ... q t q t
q
Yt j t j t
j 1
Dynamic regression model (Autoregressive
Moving Average (ARMA) model)
• ARMA (AutoRegressive Moving Average) model is a combination of two types of time series
models: the AutoRegressive (AR) model and the Moving Average (MA) model. It is used for
analyzing and forecasting stationary time series data, capturing both the relationships between
past values of the series and the past forecast errors.
• The general form of an ARMA(p, q) model is:
• Decision trees are a popular model, used in operations research, strategic planning, and
machine learning. Each square above is called a node, and the more nodes you have, the
more accurate your decision tree will be (generally). The last nodes of the decision tree,
where a decision is made, are called the leaves of the tree. Decision trees are intuitive and
easy to build but fall short when it comes to accuracy.
2. Decision Tree
3. Random Forest
• Random forests are an ensemble learning technique that builds off of decision trees.
Random forests involve creating multiple decision trees using bootstrapped datasets of the
original data and randomly selecting a subset of variables at each step of the decision tree.
The model then selects the mode of all of the predictions of each decision tree. What’s the
point of this? By relying on a “majority wins” model, it reduces the risk of error from an
individual tree.
4. Support Vector Machine
• The support vector machine (SVM) is a supervised learning model with associated
learning algorithms that analyze data used for classification and regression
analysis.
• Let’s assume that there are two classes of data. A support vector machine will
find a hyperplane or a boundary between the two classes of data that maximizes
the margin between the two classes (see below). There are many planes that can
separate the two classes, but only one plane can maximize the margin or distance
between the classes.
4.Support Vector Machine : classification
4. Support Vector Machine : classification
4. Support Vector Machine : classification
4. Support Vector Machine : classification
4. Support Vector Machine : classification
4. Support Vector Machine : classification
Low Regularization
High Regularization
5. Support Vector Machine : regression
f ( x ) = W ( X ) b,
where W is the weight parameter and is a nonlinear
transformation function. b is the threshold or bias.
5.Neural Network
5.Neural Network
•ANN is a network of artificial neurons, which can receive inputs, change their internal states according to the
inputs, and then compute outputs based on the inputs and internal states. These artificial neurons have weights
that can be modified by a process called learning. The ANN model can be presented as
yt g ( g ( xi w I b I ) wO bO )
Where g and g are the output and input activation functions, respectively. yt and xi are output and
I O
input, respectively. b I and bO are the bias term of input and output layers, respectively. w and w are
the weight vector between the hidden layer and the input layer; and between the hidden layer and the
output layer, resoectively
5.Neural Network
• Neural Network Architecture
Let Practice
1. Regression
# Simulation
set.seed(1)
e=rnorm(100)
x=rnorm(100)
y=1+2*x+e
# Estimation
linear=lm(y~x)
summary(linear)
# Prediction
pred=predict(linear)
plot(ts(y), col="blue", lwd=2, lty=2)
lines(pred, col="red", lwd=2)
legend("bottomleft", legend=c("Pred", "True"),col=c("red", "blue"), lty=1:2, cex=1)
1. Regression : Out-of-sample Predict
4
2
ts(y)
0
-2
-4
Pred
True
0 20 40 60 80 100
Time
1.1 ARIMA and SARIMA
library (forecast)
arima1 = Arima(y ,order=c(0,1,1),seasonal=list(order=c(0,0,0),period=1))
sarima1 = Arima(y ,order=c(0,1,1),seasonal=list(order=c(1,1,0),period=1))
# in sample
inpred=fitted(arima1)
# out of sample
outpred=predict(sarima1, 5)
45
40
35
tree.pred
30
25
20
15
Index
3. Random Forest
library(MASS)
library("randomForest")
# Training
set.seed(101)
train = sample(1:nrow(boston), 300)
rf.boston = randomForest(medv~crim+rm, data = boston, subset = train)
plot(pred, lwd=2)
points(true, col="red", lwd=2)
3. Random Forest : Out-of-sample Predict
40
30
pred
20
10
Index
4. Support Vector Machine : regression
library(e1071)
# Training
svmfit = svm(medv~ crim+rm, data = boston[train,], kernel = "linear", cost = 1, scale = FALSE)
print(svmfit)
# Out-of-sample prediction and compare with actual value
pred=predict(svmfit, boston[-train, c("crim","rm")])
true=boston[-train,"medv"]
plot(pred, lwd=2)
points(true, col="red", lwd=2)
4. Support Vector Machine : regression : Out-of-
sample Predict
40
30
20
pred
10
0
Index
5. Neural network
library(neuralnet)
# One layers with 2 neuron, respectively. The activation function is tanh
# Training
nn <- neuralnet(medv~crim+rm, data=boston[train,], hidden=c(2), act.fct =
"tanh",linear.output=TRUE, threshold=0.01)
nn$result.matrix
plot(nn)
1 1 1
-2 .5
3 53
5
crim -2.47605
7.
86
11 .28
-5 .
10
-5 .87
1
29
6
046
281
06
-10.79186 medv
-3 .67 8
896
2
96
17
36
42
-3 .
5.
rm -2.81406
Two layers with 2 and 1 neuron, respectively. The activation function is tanh
5. Neural network : Out-of-sample Predict
50
40
pred
30
20
Index
Understanding Temporal Relationships
• In addition to linear regression, it has several model used to
Understanding Temporal Relationships
• ECM
• ARDL
• Quantile regression
ARDL
• Instead of working differences with Y and X which have unit roots
• you may wish to estimate the ARDL model: Recently, we can add
more lag term in ARDL as
P Q
Yt pYt p q X t q vt
p 1 q 1
= 1 + 2 + 3 + + +
Estimators of Panel regression
• Pooled Ordinary Least Squares (Pooled OLS): Assumes that there are no unique attributes of
individuals or time periods, and the data can be pooled without accounting for individual or time-
specific effects. This method ignores the panel structure.
= + 1 + 2 + 3 +
• Fixed Effects Model (FE): Accounts for individual-specific effects that may correlate with the
independent variables. The model assumes these effects are constant over time and focuses on
within-entity variation. The model is specified as:
= 1 + 2 + 3 + + +
Estimators of Panel regression
• Random Effects Model (RE): Assumes that the individual-specific effects are random and
uncorrelated with the independent variables. The model is specified as:
= + 1 + 2 + 3 + + +
• Null Hypothesis ( 0 ): The random effects model is appropriate (i.e., the random effects are
uncorrelated with the regressors).
• Alternative Hypothesis ( 1 ): The fixed effects model is appropriate (i.e., the random effects are
correlated with the regressors).
Programming
• Rcode
• Stata
• Data : Panel Data ICT 10 countries from 1990-2020
R code: Panel regression
library(plm)
#Step 1 Import Data
data=read.csv(file.choose(),head=TRUE)
head(data)
#Step 2 Convert file to be Panel data
panel <- pdata.frame(data,c("country","year"))
#Step 3 Run Panel regression (Fixed effect)
# 3.1 (Fixed effect) #effect = c("individual", "time", "twoways")
fe <- plm( GINI ~ IU+ FT, model = "within", effect = "individual", data=panel)
summary(fe)
# 3.2 (Random effect) #effect = c("individual", "time", "twoways")
re <- plm( GINI ~ IU+ FT, model = "random", effect = "individual",data=panel)
summary(re)
# 3.3 (Pooling OLS )
pool <- plm( GINI ~ IU+ FT, model = "pool", data=panel)
summary(pool)
55
R code: Hausman Test
56
STATA
• This program is highly popular for conducting Panel
Regression in the present day and offers more comprehensive
testing compared to EViews. Therefore, we can use the STATA
program for estimating Panel Regression.
• We will perform the following estimations:
5.1 Importing and setting up the data
5.2 Estimating the Fixed Effects model
5.3 Estimating the Random Effects model
5.4 Hausman Test.
57
STATA: STEP 1 Import data
Click
58
STATA: STEP 1 Import data
1) The window for entering data will appear similar to Excel. You can copy the data from the Excel file and paste it into this window.
2) Copy → Paste and select "Treat first rows as variable name."
3)The data will appear as shown in the image.
59
STATA: STEP 1 Set up Panel data
Click
60
STATA: STEP 1 Set up Panel data
column for id
column for Time(year)
Choose frequency
61
STATA: STEP 1 : Set up Panel data
Click OK
62
STATA: STEP 1 SET UP Complete
63
STATA: STEP 2 Run Fixed effects
Click
Click
64
STATA: STEP 2 Run Fixed effects
Dependent variable
Independent variables
Click
OK
GINI it 0 1 IU it 2 FTit i it
65
STATA: Fixed effects results
66
STATA: STEP 3 Run Random effects
Similar to Fixed effects
Click
Dependent variable
Independent variable
Click
OK
GINI it 0 1 IU it 2 FTit i it
67
STATA: Random effects results
68
STATA: Hausman Test (Stata Code)
Command
xtreg gini iu ft, fe
estimates store fix
xtreg gini iu ft, re
estimates store random
hausman random fix
69
STATA: Hausman Test Result
P-value=0.000 , Reject H0
70
R-code : Check Unit root test
library(plm)
#Step 1 Import Data
data=read.csv(file.choose(),head=TRUE)
head(data)
#Step 2 Convert file to be Panel data
panel <- pdata.frame(data,c("country","year"))
#Step 3 Get each variable
LLC <- purtest(GINI,test = "levinlin", lags = "AIC", pmax = 1)
71
STATA: check unit root test
Click
Click
72
STATA: check unit root test
Levin Lin and Chu Unit root test
Variable
Tick
lag
Ok
73
Panel unit root test result
74
Dynamic Panel regression
• Panel or longitudinal data enables accounting for unobserved unit-
specific heterogeneity and modeling dynamic adjustment or feedback
processes.
• Instrumental Variables (IV) and Generalized Method of Moments
(GMM) are the predominant estimation techniques for handling
models with endogenous variables, particularly when dealing with
lagged dependent variables in short time horizons.
• The model takes form as
= + 1 + 2 + 3 + + +
Some Stata milestones
December 15, 2000: xtabond command for the Arellano and Bond
(1991) difference GMM (diff-GMM) estimation.
November 26, 2003: xtabond2 command for Arellano and Bover (1995)
and Blundell and Bond (1998) system GMM (sys-GMM) estimation.
June 25, 2007: xtdpdsys command is used for system-GMM
estimation. Both xtabond and xtdpdsys are built on the xtdpd
command, offering different approaches to dynamic panel data
estimation.
June 1, 2017 : xtdpdgmm estimates a linear (dynamic) panel data
model with the generalized method of moments (GMM). The main
value added of the new command is that is allows to combine the
traditional linear moment conditions with the nonlinear moment
conditions suggested by Ahn and Schmidt (1995) under the assumption
of serially uncorrelated idiosyncratic errors.
Generalized method of moments (GMM)
Generalized method of moments (GMM)
Generalized method of moments (GMM)
where
STATA
• Data : Panel Data ICT 10 countries from 1990-2020
Select estimation
Arellano-Bond diff-GMM
Arellano-Bond diff-GMM
Arellano and Bover (1995) and Blundell and
Bond (1998) system GMM (sys-GMM)
Workshop on real data application
Paper 1 : ARIMA and SARIMA
ARIMA forecasting of primary energy demand
by fuel in Turkey
• Study: In this study, they used the Autoregressive Integrated Moving
Average (ARIMA) and seasonal ARIMA (ARIMA) methods to estimate
the future primary energy demand of Turkey from 2005 to 2020
• Data: Primary energy demand of Turkey from between 1950 and
2004 ( However, this example data covers only 1965-2005)
https://ourworldindata.org/energy/country/turkey
• Method : ARIMA and ARIMA
R code
# Load the readxl package
library(readxl)
library (forecast)
# Read the data from sheet "paper1"
data <- read_excel(file.choose(), sheet = "paper1")
# Conver to data frame
data=data.frame(data)
# Display the first few rows of the imported data
head(data)
attach(data)
# Set time series data
energy=ts( data[,2], start=1965, freq=1)
arima1 = Arima(energy,order=c(1,1,1),seasonal=list(order=c(0,0,0),period=1))
sarima1 = Arima(energy,order=c(1,1,1),seasonal=list(order=c(1,2,1),period=1))
# In of sample forecast
inpred=fitted(arima1 )
# Out of sample forecast
outpred1=predict(arima1, 15)$pred
outpred2=predict(sarima1, 15)$pred
# Combine actual data with forecasted values
extended_energy <- ts(c(energy, outpred2), start = 1965, frequency = 1)
# Add legend
legend("topleft", legend = c("Actual", "Out-of-sample Forecast arima", "Out-of-sample Forecast sarima"),
col = c("black", "red", "blue"), lty = c(1, 1, 1), lwd = 2, cex = 1)
Paper 2 : ARDL model
Economic growth and biomass energy
• Study: This paper investigates the short-run and long-run causality analysis
between biomass energy consumption and economic growth in the selected 10
developing and emerging countries by using the Autoregressive Distributed Lag
bounds testing (ARDL) approach
• Data: Argentina, Bolivia, Cuba, Costa Rica, El Salvador, Jamaica, Nicaragua,
Panama, Paraguay and Peru. bc represents the biomass energy consumption log
(bct), and py represents the logarithm of real GDP. Data were taken from World
Bank, the International Energy Agency. Data covers the 1980-2009
# Bio energy consumption
Energy Statistics Data Browser – Data Tools – IEA
# REAL gdp
https://data.worldbank.org/indicator/NY.GDP.MKTP.CD?locations=AR-BO
• Method : ARDL
Methodology
• Methodology
The ARDL model for the standard log-linear functional specification of
long-run
Methodology
• Methodology
The Error Correction model used to analyze relationships between the
variables was constructed as follows:
ARDL r code ( Argentina case)
# Load the package
library(ARDL)
library(urca)
# Read the data from sheet "paper1"
data <- read_excel(file.choose(), sheet = "paper2")
# Conver to data frame
data=data.frame(data)
# Display the first few rows of the imported data
head(data)
attach(data)
# transform to log
logBioArg=log(BioArgen)
logGDPArg=log(GDPArgen)
## Step 1 Unit root test I(0)
u1=ur.df(logBioArg,type="drift",selectlags = "AIC")
u2=ur.df(logGDPArg,type="drift",selectlags = "AIC")
summary(u1)
summary(u2)
## Unit root test I(1)
u3=ur.df(diff(logBioArg),type="drift",selectlags = "AIC")
u4=ur.df(diff(logGDPArg),type="drift",selectlags = "AIC")
summary(u3)
summary(u4)
## Step 2 Find the best ARDL order --------------------------------------------
data1=data.frame(data,logBioArg,logGDPArg)
model1 <- auto_ardl(logGDPArg~ logBioArg, data = data1,max_order = c(4,4))
model1$top_orders
ARDL code
## Step 3 Estimate best ARDL
ardl_argen <- ardl(logGDPArg~ logBioArg,data = data1, order = c(1,1))
# Step 4 Estimate the ARDL-ECM
uecm_argen <- uecm(ardl_argen)
summary(uecm_argen )
# Step 5 Estimate the ECM
recm_argen <- recm(ardl_argen , case=3)
summary(recm_argen )
# Step 6 Bounds test from Pesaran et al. (2001)
bounds_f_test(ardl_argen , case = 3)
# step 7 Long run
multipliers(ardl_argen, type = "lr", vcov_matrix = NULL)
Now let’s do the ARDL in which Biomass is the dependent variable and GDP is the independent variables
Paper 3 : Panel regression
• Study: This paper revisits the renewable energy-economic growth nexus in
seven European countries for the 34-year period of 1985–2018.
• Data: seven OECD countries in Europe (Germany, Italy, Netherlands,
Poland, Spain, Turkey, and United Kingdom), spanning the period of 1985–
2018. All data are taking logarithm
• Renewable energy consumption (RE) and electricity generation shares are
derived from BP’s 2019 Statistical Review of World Energy data file. We
obtain the OECD Europe price indexes for coal and natural gas from the IEA
Energy Prices and Taxes database.
• The data for real GDP (Y) is acquired from the International Monetary
Fund. Finally, the fixed gross capital formation (K) and labour force (L) data
are from World Bank’s World Development Indicators databank.
• Method : Panel regression ( Pooled mean group)
Methodology
• Model
This separation maintains the full expression of the PMG regression model,
capturing both the error correction term (long-run relationship) and the short-
run dynamics
PMG results
Paper 4 : Dynamic panel model
Do shareholder coalitions affect agency
costs? Evidence from Italian-listed companies
Study : This study investigates the relationship between agency costs and
ownership structure for a sample of listed Italian companies to determine the
impact of shareholder coalitions on agency costs.
Data: Using a balanced panel dataset of 163 Italian firm-year observations for the
period 2002–2013
- available data on ownership structure for the entire study period; information
acquired from the Consob (Commissione Nazionale per le Società e la Borsa,
2014) website; and individual company reports on corporate governance. •
- available data on firm-level indicators (debt-to-capital ratio, size, age of the firm,
industry sector) for all companies in the sample. Data were collected from
Datastream, Bloomberg, Calepino dell’Azionista (Mediobanca, 2014), and
obtained manually from the financial statements of the individual companies.
Methodology : Dynamic panel data model involving a two-step system-GMM (
Methodology
Stata : DPD SYSTEM-GMM (2 steps)
Autocorrelation and Sargan test
Thank you